arXiv: 1508.03787v 1 [cs.IT] 16 Aug 2015 


Information-theoretically Secure Erasure Codes 

for Distributed Storage 

Nihar B. Shah, K. V. Rashmi, Kannan Ramchandran, Fellow ; IEEE , and R Vijay Kumar, Fellow, 

IEEE 


Abstract 

Repair operations in distributed storage systems potentially expose the data to malicious acts of passive 
eavesdroppers or active adversaries, which can be detrimental to the security of the system. This paper presents 
erasure codes and repair algorithms that ensure security of the data in the presence of passive eavesdroppers and 
active adversaries, while maintaining high availability, reliability and efficiency in the system. Our codes are 
optimal in that they meet previously proposed lower bounds on the storage, network-bandwidth, and reliability 
requirements for a wide range of system parameters. Our results thus establish the capacity of such systems. 
Our codes for security from active adversaries provide an additional appealing feature of ‘on-demand security’ 
where the desired level of security can be chosen separately for each instance of repair, and our algorithms 
remain optimal simultaneously for all possible levels. The paper also provides necessary and sufficient conditions 
governing the transformation of any (non-secure) code into one providing on-demand security. 


I. Introduction 

The amount of data stored in large-scale, distributed storage-systems such as data centers is increasing 
exponentially. In order to scale massively at low costs, data-centers employ inexpensive commodity hardware. 
As these components are prone to failure [3], [4j, the system must possess enough redundancy to ensure that 
data remains reliable and available in the face of these failures. One means of introducing redundancy is via 
replication. However, replication is inefficient with respect to storage space utilization, and thus in order to 
scale economically, data centers are increasingly turning to the use of erasure coding as a far more efficient 
option (5), |6). 

Consider a distributed storage system with n storage nodes across which some data (termed the message ) 
is to be stored. Each of these n nodes stores only a fraction of the data. In order to provide reliability and 
availability, the erasure codes considered in this paper ensure that a user (termed a data collector) must be able 
to recover the message from the data stored in any k of the n nodes. This property is called data reconstruction 
or simply reconstruction. The reconstruction property provides the storage system a capability of tolerating 
failures of any (n — k) of the n nodes. 

Upon failure of a storage node, a replacement node is designated to store the data that was stored previously 
in the failed node. The replacement node recovers this data by downloading (a part of) the data stored in the 
remaining nodes. This is termed a node-repair or simply a repair operation. Traditional erasure codes typically 
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Fig. 1: Compromise of security when using a Reed-Solomon code due to (a) passive eavesdroppers, and (b) 
active adversaries. 


handle repair by first downloading and reconstructing the entire message and then extracting the required data 
from it. Such an operation is quite wasteful of the network resources 0 , 0 , 0 , and several recent works [7 ]— 
J28| propose new erasure codes and repair algorithms addressing this issue. 

Security is an important aspect of distributed storage systems, and this is underscored by the many incidents 
of data compromise in the recent past (e.g., @-(3T)). This paper considers the problem of designing codes 
and algorithms for distributed storage that ensure security in addition to maintaining other properties such as 
reliability, availability and efficiency. We consider the information-theoretic notion of security (and not the 
computational notion), making no assumption about the computational power of the adversary. 


In designing a secure distributed storage system, special attention must be given to the repair operations 
since they can pose security hazards. For instance, in a system using traditional erasure codes, the repair of a 
failed node would require the download of the entire data at the replacement node. An eavesdropper tapping 
onto the replacement node can thus obtain the entire data. Alternatively, during repair of a failed node, an 
active adversary that has captured one or more of the remaining nodes may pass corrupt data during the repair 
operation. Such an action would corrupt the data stored in the replacement node, and these errors may then 
propagate across the entire system during subsequent repairs of other nodes. This makes repair operations a 
serious security threat, and motivates the goal of this paper. 


Fig. [T] illustrates the problem of security under repair dynamics using a toy example of a Reed-Solomon code 
with n = 4 and k = 2. The code stores the message {a, b} across four nodes, where both a and b belong to 
the finite field F 5 in this example. It is easy to verify that the entire message can be recovered by downloading 
the data stored in any k = 2 of the four nodes. On failure of any node, the replacement node connects to 
any two other nodes, downloads all data stored in them, from which the replacement node recovers the data 
stored in the node prior to failure. Fig. [la| illustrates a setting with passive droppers. Specifically, consider an 
eavesdropper who can read all the data stored in node 1. Under the Reed-Solomon code, the eavesdropper can 
gain access to the symbol a. Furthermore, if the eavesdropper was also able to listen to the data passed for any 
repair operation, then it gains access to the entire message. Fig. [Tb] illustrates a setting with active adversaries. 
Suppose an active adversary gains access to one of the nodes, say node 3, and suppose the repair of node 1 
is performed by connecting to nodes 2 and 3. Then, as seen in the figure, the adversary can pass malicious 
data in the repair process (for instance, (a + b + 1 ) instead of (a + b )), making the replacement node (a + 1 ) 
believing it is actually storing a. This error, in turn, propagates during further repairs, and also sabotages all 
subsequent reconstruction operations involving node 1 . 
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In this paper, we model the distributed storage system based on the regenerating codes model introduced by 
Dimakis et al. [32]. In addition to the parameters n and k introduced earlier, this model has a third parameter 
d. Upon failure of a node, the replacement node may connect to any d of the remaining nodes, and should be 
able to recover the data that was stored previously in the failed node by downloading a minimal amount of data 
from these d nodes. It was shown in [021 that there is a tradeoff between the total amount of data stored per 


node and that downloaded for repair of a failed node, and this is described in greater detail Section |II-C| The 
two extreme regimes in the tradeoff are the minimim storage regenerating (MSR) and the minimum bandwidth 
regenerating (MBR) regimes. The MSR regime aims to minimize the amount of storage required per node, and 
for this amount of storage, minimizes the amount of download. On the other hand, the MBR regime aims to 
keep the amount of download to an absolute minimum, and for this amount of download, minimizes the amount 
of storage. 


The initial work [ |32| on regenerating codes considers the repair to be ‘functional’, wherein a replacement 
node need not be identical to the failed node, but should only satisfy further reconstructions and repairs. A 
strictly stronger and practical requirement is that of ‘exact’ repair, where the replacement node must obtain and 
store data identical to that in the failed node. Throughout this paper, we consider only exact repair. 


The threat model we consider for security in this paper is an extension of the threat model proposed in (33 ] 
for the regenerating codes setup. Two classes of threats are considered: 


• Security from passive eavesdroppers: This threat class involves preventing leakage of any information about 
the data to passive eavesdroppers who may gain access to a subset of the storage nodes. By ‘passive’ we 
mean that the eavesdropper may read and store any data it gains access to, but does not corrupt any data. 
In our threat model, two parameters i and m determine the level of security to be provided: the goal is to 
ensure that a passive eavesdropper having access to the data stored in any i of the nodes and additionally 
to the data passed for repair in any m of these £ nodes, gains zero information about the message. 

• Security from active (malicious) adversaries: This threat class involves ensuring that the presence of an 
adversary who gains access to a subset of the storage nodes and wishes to actively corrupt the data does 
not affect the operations in the system. The active adversary may corrupt the data stored in a subset of 
storage nodes and also pass erroneous data to other nodes during their repair operations. In particular, for 
some parameter p determining the level of security to be provided, the goal is to guard against p nodes 
being corrupted by an active adversary. In such a scenario, if a replacement node (or a data-collector) 
connects to any of the p corrupt nodes, then these corrupt nodes could pass arbitrary data in the repair 
(or data-reconstruction) operation. The goal here is to successfully accomplish the node-repair and data- 
reconstruction operations even in the presence of such an attack. 


In this paper, we present explicit codes for distributed storage at MSR and MBR points that ensure security of 
the data from passive eavesdroppers and/or active adversaries. The secure codes at the MBR point are applicable 
for all values of the parameters [n, fc, d\, and the secure codes at the MSR point are applicable for all values 
of the parameters satisfying d > 2k — 2. The secure codes presented have two attractive features: 


• Optimality: Our codes for security from active adversaries and from passive eavesdroppers at the MBR 
point are optimal for all values of the parameters. The codes for security from active adversaries at the 
MSR point is optimal for all parameters satisfying d > (2k — 2). The codes for security from passive 
eavesdroppers at the MSR point are optimal for all parameters satisfying d > (2k — 2) with m < 1. The 
codes presented in this paper, thus, establish the secure capacity of such systems. 

• On-demand security from active adversaries: Under our codes, when dealing with active adversaries, 
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the protection level p need not be fixed a priori, but can be chosen flexibly at “run-time” when any 
repair or reconstruction operation is being executed. The protection level p can be chosen separately and 
independently for each instance of repair or reconstruction. This “on-demand” security endows the system 
with the advantage of not having to preset the security level and associated system resources for the worst 
case. This is in contrast to the traditional models of information-theoretic security (e.g., [33]) that take a 
“static” approach towards system design where the price is paid, in terms of the reduction in the size of the 
message stored, corresponding to the magnitude of the “worst-case” security level required. It turns out, 
perhaps surprisingly, that our codes do not not require any additional storage to support this on-demand 
property, and are optimal for all values of p. 


The problem at hand is closely related to the problem of secure network coding [34]—[42]. The literature 
on secure network coding primarily considers a multicast setting where one or more receivers wish to obtain 
the entire data that was transmitted. The threat models typically consider the scenario of link compromise in 
the networks. The problem we consider, on the other hand, falls into the harder non-multicast setting [ 341, and 
furthermore, needs to handle the harder case of adversaries or eavesdroppers being able to capture nodes in the 
network [40]. The results of this paper thus establish the capacity of a class of non-multicast networks in the 
presence of active adversaries or passive eavesdroppers having the ability to compromise nodes. Interestingly, 
the capacity-achieving codes that we propose in this paper are linear, deterministic and explicit. 

While the primary focus of this paper is information-theoretic security, the results of this paper also are 
also relevant to other applications such as correcting network errors and erasures and for reducing latency of 
“degraded reads” in data centers. These applications are discussed in more detail in Section |II-D| 

The rest of the paper is organized as follows. Section [II] formalizes the system model, and describes our 
approach towards the code construction problem. Some additional applications of our codes are also discussed in 
this section. Section [HI] describes related literature. The codes presented in this paper build upon our previously 
proposed product-matrix framework [llj, and this framework is overviewed in Section |TV| Sections |V| and 
Section [VI| present explicit constructions of MBR codes that provide security from active adversaries and from 
passive eavesdroppers respectively. Section |VII| and Section |VIII| present explicit construction of MSR codes 
that provide security from active adversaries and from passive eavesdroppers respectively. Section [IX] provides 
necessary and sufficient conditions governing the transformation of any (non-secure) erasure code into one 
providing on-demand security. Section [X] presents a concluding discussion. 


II. System Model, Our Approach, and Summary of the Results 

We will first describe the system model in the absence of security requirements (the ‘regenerating codes’ 
model), following which we describe the extension that incorporates the provision of security. Alongside, we 
also describe our approach towards constructing secure distributed storage codes, and a summary of the results 
of this paper. 


A. Regenerating Codes Model (in the absence of security requirements) 

Under the regenerating codes model (32), the system comprises of n storage nodes, across which data 
comprising of B symbols is to be stored. This set of B symbols is called the message , and each of these 
symbols is assumed to belong to a finite field of size q. Each of the n storage nodes has a capacity of 
storing a symbols. The data is to be stored such that a user (termed data-collector ) can recover the entire 
message by downloading the data stored in any k of these n nodes. This process is termed data-reconstruction , 
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or simply reconstruction. It follows that any system satisfying the reconstruction property can tolerate the failure 
of any [n — k ) storage nodes without losing any data. 

Now let us consider the repair operation. When a storage node fails, it is replaced by another node, called the 
replacement node, that stores exactly the same data as the failed node. The regenerating codes model contains 
two additional parameters, d and /3, that are associated to the repair of failed nodes. The replacement node is 
permitted to connect to any d (> k) nodes out of the remaining (n — 1 ) nodes while downloading /3 (< a) 
symbols from each node. This set of d nodes are termed the helper nodes for this instance of repair. From the 
set of df3 symbols thus obtained, the replacement node must recover the a symbols that were stored in the 
failed node. The total amount d/3 of data downloaded for repair purposes is termed the repair bandwidth. 

It is shown in (32]| that the parameters associated with a regenerating code must necessarily satisfy the bound 

k-1 

B < min (ct, (d — i)/3 ) . (1) 

i =0 


Since both storage and bandwidth come at a cost, it is naturally desirable to minimize both a as well as d/3, 
and try to achieve the bound ([I]) with equality, i.e., 


k-1 

B = min (ct, (d — i)/3 ) . 
2=0 


( 2 ) 


It can be deduced (see [32]) that achieving Q, for fixed values of B and [n, k. d], leads to a tradeoff between 
the storage space a and the repair-bandwidth df3^\ 


Two important regimes of this tradeoff are its extremities, termed the minimum storage regenerating (MSR) 
and minimum bandwidth regenerating (MBR) regimes, described below. 


Minimum Bandwidth Regenerating (MBR) regime: The MBR regime entails the minimum possible repair- 
bandwidth: the amount of data downloaded by a replacement node is no more than the amount of data that 
was stored in the failed node. To arrive at the MBR regime, one must first minimize the bandwidth dj3 in the 
tradeoff ([2]), and then obtain the minimum storage a for this value of repair-bandwidth. Clearly, a replacement 
node must download at least as much data as what was stored in the failed node. Moreover, it is shown in []43| 
that for optimal performance, every storage node must utilize its full storage capacity of a symbols. Thus, the 
minimum bandwidth required is d(3 = a. To satisfy ([2]) with this value of d) 3, the parameters associated to the 
MBR regime must satisfy 

- (k- 1)/3 


B 

“~i + 


df3 = a . 


( 3 ) 


Minimum Storage Regenerating (MSR) regime: The MSR regime allows for the minimum possible storage 
capacity: the requirement of being able to recover all B message symbols from any k nodes mandates the 
storage per node to be at least and the MSR regime requires attainment of this minimum, i.e., has a = j-. 
Substituting this value in ([2]), it follows that the parameters associated to the MSR regime are 

a = , d[3 — a -\- (k — l)/3 . (4) 

k 

Any code operating in the MSR regime is thus Maximum-Distance-Separable (MDS). 

The two extremities of the tradeoff curve have been well-studied in the literature, and there exist several 


l ln addition to achieving (2|, the parameters must also satisfy the following minimality condition: a reduction in either a or f3 must 
result in a violation of ([lj. 




6 


explicit codes -(Tg operating in these regimes and achieving ([2]) for exact repair. On the other hand, it 
is shown in fl2] that at essentially all other points on the tradeoff curve, there cannot exist any codes that 
satisfy ([2]) for exact repair. Tighter outer bounds have been proposed for these intermediate points on the tradeoff 
curve m-m 

Having described the basic setup of regenerating codes (in the absence of security), we will now present a 
description of the setting where security from passive eavesdroppers and/or active adversaries is required. 


B. Security from Passive Eavesdroppers 


1) Threat Model and Upper Bounds: We consider a threat model that extends the model introduced in j33|. 
under the model we consider, a passive eavesdropper may gain access to the data stored in a subset of the storage 
nodes, and possibly also to the data downloaded during repair of some of these nodes. We assume that the 
eavesdropper may possess unbounded computational power and has complete knowledge of the system protocol. 
It is required that such an eavesdropper should not be able to obtain any knowledge (in the information-theoretic 
sense) about the message. 

An eavesdropper can gain read-access to the data stored in any arbitrary set of at most £ (< k) storage nodes. 
In addition, the eavesdropper may also listen to the data downloaded during (any number of) repair operations of 
some arbitrary subset of m (< £) of these £ nodes. We call such an eavesdropper an {£, m}-eavesdropper , and 
term the security of data from such a passive eavesdropper as \t, m}-security from the passive eavesdropper. 
Note that since the data passed to a node for repair also contains the data it will finally store, the setting 
described above can be equivalently stated in the following manner: 

Definition 1 ({£, m}-security from the passive eavesdropper): A code provides {£, m }-security from the pas¬ 
sive eavesdropper if a passive eavesdropper gaining access to only the data stored in any {£ — m) nodes, and 
the data passed for repair to any m nodes, obtains zero information about the message. 

As an example of this model, consider a peer-to-peer storage system. The m nodes described above may 
represent nodes that are in a network belonging to the eavesdropper, thereby allowing the eavesdropper to listen 
to all the data downloaded when these m nodes undergo (possibly multiple) failures and repairs across time. On 
the other hand, the (£ — m) nodes may represent the nodes which may be exposed only momentarily, allowing 
the eavesdropper access to only the data stored. 

We now describe an upper bound on the size of the message that can be stored for any code under this 
system model. In J33| , Pawar et al. consider the setting of m = £ and provide an upper bound on the number of 
message symbols B * that can be stored in the information-theoretically secure system. It can be shown easily 
that even under our extended setting of having m < £, the proof of this upper bound (Theorem 1 of j33j) works 
for any value of m (< £). Using the bound from [33] for this setting, it follows that the parameters associated 
to any regenerating code providing {£, m}-security from passive eavesdroppers must necessarily satisfy 


k -1 

£>* < min (a, (d — i)/3 ) . 

i—i 

Specializing to the MBR regime of a = dfi, this bound becomes 


( 5 ) 

( 6 ) 


For the MSR regime, the bound in ([5]) turns out to be loose. Subsequent to the conference publication [ 1 ] 
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of our code constructions for security from passive eavesdroppers, in [48|, the authors have provided tighter 
upper bounds for the MSR regime: 

5* < . (7) 


2) Summary of Our Results for Passive Eavesdroppers: This paper provides explicit constructions for: 


• MBR codes for all parameters [n, fc, d\, {£, m} providing information-theoretic security from passive 
eavesdroppers. These codes are optimal for all values of the parameters. 

• MSR codes for all parameters [n, k , d > 2k — 2], {£, m} providing information-theoretic security from 
passive eavesdroppers. These codes are optimal whenever m < 1. 


The MBR code is optimal by virtue of meeting the outer bound ([ 6 ]) (derived in (33]|) with equality. The MSR 
code is optimal for m < 1 by virtue of meeting ([7]) (derived in [4J]). 


Our codes thus establish the secure capacity of distributed storage systems for the case of passive eaves¬ 
droppers for all parameters at the MBR regime and for all [n, fc, d > 2k — 2] and m < 1 at the MSR 
regime. 


We take the following approach for constructing the codes. To construct a secure code for a given [n, fc, d], 
we choose a product-matrix (PM) code with identical values of system parameters [n, fc, d\. In the input to the 
PM code (without secrecy constraints), we replace a carefully chosen set of 


R = B - 5* (8) 

message symbols with R random symbols, where B is as obtained by setting equality in ([3]) for the MBR case 
and in ([4]) for the MSR case, and £>* is the number of message that the code can store under the specified security 
requirement. Each of these random symbols are chosen uniformly and independently from ¥ q , and are also 
independent of the message symbols. We then prove that this construction ensures that no {£, m}-eavesdropper 
can obtain any information about the message, thus ensuring {£, m }-security from passive eavesdroppers. 

3) Example: We illustrate our code construction with a toy example of an MBR code with [n = 3, k = 
2, d = 2] providing {£ = 1, m = Insecurity from passive eavesdroppers. The code is shown in Fig. [2j 
The message {a} is encoded and stored across n — 3 nodes in a manner that it can be recovered from any 
k = 2 of the nodes. The alphabet of operation is the finite field F 3 . Symbols r\ and 7*2 are drawn uniformly at 
random from F3. A failed node is repaired by downloading one symbol each from any d —2 nodes. A passive 
eavesdropper gaining access to the data stored in any £ — 1 node and also to the data passed to that node 
during any of its repair operations (m = 1) gets zero information about the message. For instance, the repair of 
node 2 is shown in the figure, where one can see that the eavesdropper gains no information about a from the 
data stored in or passed to any one node. The code is optimal meaning that the amount of download required 
for any repair is the minimum possible, and furthermore, the amount of storage required in this setting is also 
minimum. 


C. Security from Active Adversaries 

1) Threat Model and Upper Bounds: We consider the threat model wherein one or more nodes may be 
compromised to an active adversary. The adversary is assumed to be computationally unbounded and knows 
the protocol followed by the system. A node that is compromised to an active adversary may send arbitrarily 
corrupted data during any data-reconstruction or node-repair operation. It is required that the data-reconstruction 
operations complete successfully (without any errors) despite the presence of these active adversaries. This threat 




J 


node 1 


node 2 


node 3 



Eavesdropper gets no 
information about “a” 


Fig. 2: MBR code with [n = 3, k — 2, d = 2] providing optimal {£ = 1, m = l}-security from passive 
eavesdroppers. Repair of node 2 is shown, and one can see that the eavesdropper gains no information about 
a from the data stored in or passed to any single node. 


model was introduced in [f33| . Note that this threat model does not require node repairs to be performed without 
errors, and only requires data-reconstruction operations to be performed without errors. 

The upper bound on the size of the message that can be securely stored under the above described threat 
model with at most p compromised nodes, as derived in ]33| is 

k -1 

B < min (a, (d — i)/3 ) . (9) 

i=2p 

Our codes achieve this bound and also ensure that every repair operation is also without corruption. 

Along the lines of traditional models of information-theoretic security, [33]| takes a “static” approach to 
secure-code construction, wherein the size of the message stored is reduced by an amount proportional to the 
magnitude “p” of the required security level. In particular, the value of p chosen is typically based on worst-case 
estimation, which leads to considerable wastage of system resources during the normal operations. Furthermore, 
in the event that the number of nodes compromised is greater than that anticipated, the security level cannot 
be increased one-the-fly and hence security can no longer be guaranteed. 

Our approach to secure data from active adversaries is different from the conventional approach that we just 
described. Under our approach, the message is encoded and stored independent of any security requirements, 
and all the security requirements are handled “on-demand” at the decoding stage by downloading additional 
data. Let us assume that at some point in time we estimate a requirement of protection from the compromise of 
any p out of the n nodes in the network. Under our approach, this value of p is not fixed at the time of encoding 
of the data, but possible corruptions are corrected by downloading additional data during the reconstruction or 
repair processes. Now, during a particular instance of reconstruction or repair, the amount of additional data 
downloaded is a function of the desired protection level p , and this protection level can be chosen independently 
for each reconstruction and repair operation. This provides the advantage of not having to allocate resources for 
the worst case. This approach also enables the system to function with dynamic protection levels, and hence 
there is no need to estimate and fix the parameter p beforehand: the system administrator is free to choose the 
desired level of security at runtime. 

In our approach, the additional data downloaded for correcting possible corruptions is obtained by allowing a 
(greater) connectivity of A (> d) nodes during repair and n (> k) nodes during reconstruction. The parameters 
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A and k depend on the desired protection level p as 

A = d + 2p (10a) 

n = k + 2p . (10b) 

Under this notation the outer bound of [33] translates to 

K — 1 

B < min (<%, (A — z)/3) . (11) 

i=2p 

Observe that substituting the values of n and A in terms of k and d from ( [TO] ) in ( [IT] ) results in ([!]). It follows 
that achieving © with equality for all p is equivalent to achieving © and ([I]) with equality for all p. In 
particular, it amounts to achieving ( flQ| ) along with ([3]) for the MBR regime and 0 for the MSR regime. 

Our goal is to construct codes which can secure the data from active adversaries for any value of the desired 
level of security p that can be chosen separately at each instance of repair or reconstruction. We also require 
that, in addition, the codes be optimal simultaneously for all values of p. We term this on-demand security. 

Definition 2 (On-demand security): A code is said to provide on-demand security if for any arbitrary (fea¬ 
sible) choice of p made at the time of a repair (or reconstruction) operation, the repair (or reconstruction) 
algorithm can detect and correct all corruptions in the presence of up to p compromised nodes. 

Note that by all feasible values of p, we mean all values of p for which the additional connectivity requirement 
described in ( flQ| ) can be satisfied (i.e., for all p such that (d + 2 p) < (n — 1)). 

Remark 1 (On-demand detection of errors): Note that the definition of on-demand security also requires 
detection of corruption, when up to p of the nodes are compromised during any repair (or reconstruction) 
operation. The corruption detection can be performed by connecting to (d+p) (or (k+p)) nodes and downloading 
(3 (or a) symbols from each of them during any repair (or reconstruction) operation as opposed to (d + 2 p) (or 
(k + 2 p)) nodes for correction of corruption. The error detection properties of the codes presented in this paper 
follow from their error correction properties in a straightforward manner using the basic relationship between 
detection and correction of errors in block codes f49fl . We hence focus only on the error correction in the rest 
of the paper, and the error detection properties follow as implicit corollaries. 

In our codes providing on-demand security, the parameters [n, fc, d\ and {/?, a, B} remain fixed, while the 
values of A and n may vary during different instances of reconstruction or repair operations depending on the 
security level p desired at that time. Furthermore, our codes achieve the bound ( |TT| ) with equality for all values 
of p, and are hence optimal. 

2) Summary of Our Results for Active Adversaries: This paper presents explicit constructions for: 

• MBR codes for all parameters [n, fc, d\ providing on-demand security from active adversaries. The codes 
are optimal for all values of the parameters. 

• MSR codes for all parameters [n, fc, d > 2k — 2] providing on-demand security from active adversaries. 
The codes are optimal for all values of the parameters. 

The optimality of the codes follows from the fact that the MBR codes meet ([3]) and © for all values of 
the parameters and the MSR codes meet 0 and ( flQ| ) for all values of the parameters covered. Our results 
thus establish the secure capacity of distributed storage systems for the case of active adversaries for the 
aforementioned parameter regimes. 
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Fig. 3: MBR code with [n = 5, k — 2, d — 2] providing optimal and on-demand security. Repair of node 1 is 
shown when security from at most p = 1 compromised node is required. We can see that the repair operation 
proceeds successfully in spite of the presence of a compromised node that sends arbitrarily corrupted data. 


3) Example: We illustrate our code construction with an example of an MBR code with [n = 5, k = 2, d = 2] 
providing optimal and on-demand security. The code is shown in Fig. [3] The message {a, 6 , c} is encoded and 
stored across n — 5 nodes in a manner that it can be recovered from any k — 2 of the nodes. The alphabet 
of operation is the finite field F 5 . In the scenario that a repair operation needs to be made secure from the 
compromise of any p — 1 node, the replacement node connects to any 4 nodes and downloads one symbol from 
each as shown in the figure. Here, even in the presence of one arbitrary corruption (node 4 in the example), 
the desired symbols a and b are decoded correctly. When no security is required (p = 0), a failed node is 
repaired by connecting to any d = 2 nodes and downloading one symbol from each of them: the symbol passed 
by any node is identical to what it would have passed in the p = 1 case. This code is optimal and provides 
on-demand security: a repair operation may choose the desired level of protection p at runtime and the amount 
of network-bandwidth consumed in either case is the minimum possible, and furthermore, the amount of storage 
required to support this amount of download is also minimum. 


D. Other Applications 

While the codes constructed in this paper are presented from the perspective of providing security in a 
distributed storage system, they can also be employed for other relevant applications. 

1) Handling Packet Errors and Erasures in the Network: During a node-repair or a data-reconstruction 
operation, due to noise in the network, one or more packets downloaded may contain (arbitrary) errors. 
This is equivalent to a situation in the security scenario in which a set of nodes that are compromised to 
an active adversary may transmit corrupt packets. Our codes for on-demand security from active adversaries 
can alternatively be employed for handling packet-errors and erasures in the network. The reader is referred to 
a conference version © of this paper for more details on this application. 

The codes for security can correct for packet erasures as well. In this case, a node-repair (or data-reconstruction) 
operation guarding against p errors and p' erasures is carried out by downloading (3 (or a) symbols each from 
any arbitrary (d+2p+p f ) (or k+2p+p f ) nodes. An erasure of some p' of these packets leaves us with data from 
some arbitrary subset of [d + 2p) (or (k + 2 p)) nodes, p of which could be in error. This resultant scenario is 
identical to the setting of Section |II-C[ which is addressed by our codes. The properties of on-demand security 
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and optimality in the amount of download are also retained in the case of erasures. 

2) Reducing Latency of Degraded Reads in Data Centers: The property of on-demand security can also aid 
in the reducing the latency of degraded reads in data centers. A degraded read (also called online repair) is 
an operation which is executed when a request comes in for data stored in a node that is busy or has failed. 
Under this operation, the request is met by downloading and recovering the requested data from the remaining 
nodes via a repair operation. Degraded reads are typically latency-sensitive, i.e., they must be served quickly. 


If the underlying code provides on-demand security, then it allows for the degraded read operation to be 
executed faster in the following manner. Let / denote the busy/failed node that stores the requested data. Now, 
one could use the repair property of the erasure code to recover the required data from any d other nodes. 
However, due to various sources of randomness in the system such as congestion at the nodes and delay in 
transmission, the replies from these d nodes will typically arrive at different times. As a result, the net latency 
will be the maximum of the response-times of these d nodes. On the other hand, a code providing on-demand 

as follows. For some 


security allows a reduction in latency by means of sending ‘redundant-requests’ [50]—[ 
choice of a parameter r (d < r < n), connect to any r nodes and ask each of these nodes to aid in the ‘repair’ 
of node /. Since the code provides on-demand security, one can recover the data of node / (and hence serve 
the request) from the data obtained from the first d nodes that reply (the requests to the remaining nodes are 
then canceled). This lowers the latency of serving this request to the response-time of the dth fastest among 
the r {d < r < n ) helper nodes. 


E. Notational Conventions 

A vector will be treated as a column vector by default, and a row vector will be written as the transpose of 
the corresponding column vector. The transpose of a vector or a matrix will be denoted by a superscript T. 
The term ‘randomly drawn’ will mean ‘drawn uniformly at random’. 

III. Related Literature 


A. Codes for Distributed Storage 


The ‘regenerating codes’ model introduced in (32| considers optimizing two important resources: the storage 
capacity required per node, and the repair-bandwidth. It was shown in (32| that there exists a tradeoff between 
these two resources, and lower bounds on their requirements were derived. Subsequent to the work of (32), 
several explicit codes (9j-(T2), (l4| , fl6| , (T8| , (55j-|58| were constructed for the MSR and the MBR regimes 
of regenerating codes, many of which meet these bounds. Furthermore, it was shown in (12), 04]], (59) that 
the bounds are loose at essentially all points in the interior of the tradeoff curve. 

A class of explicit codes that are optimal in terms of the storage and bandwidth requirements are the product- 
matrix codes proposed in (Tl). The results of this paper are based on the product-matrix codes, and exploit 
certain unique features of the underlying product-matrix framework. A detailed description of the product-matrix 
codes is provided in Section |TV| 


The requirement of security in the presence of repair dynamics was first considered in (33]]. In (33], the 
authors provide lower bounds on the storage and bandwidth requirements under such a setting. Secure codes 
for the MBR regime with d = n— 1 are also provided, which are based on the repair-by-transfer code of (9), (12) . 
The codes of |[33) providing security from active adversaries, also for the MBR regime with d — n— 1, allow 
propagation of the errors during the repair operation, and treat the resulting errors only at the reconstruction stage. 
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Such a property may not be desirable in practice since the system administrators may not approve of allowing 
errors to linger around and propagate through the system. The secure codes presented in the present paper are 
applicable to any choice of the system parameters [n, fc, d\ in the MBR regime, and any [n, d > 2k — 2] 
in the MSR regime. Furthermore, our codes ensure the correction of errors during every individual repair and 
the reconstruction operation. While this additional requirement clearly makes the system more practical, we 
show that surprisingly, this more stringent requirement does not necessitate any additional storage or bandwidth 
requirements. 


Subsequent to the initial presentation of our codes in [1], there have been several other works on information 
theoretic security in distributed storage systems with repair considerations ( J48J, f60|-]63]). In (48}, the authors 
consider security from passive eavesdroppers and present outer bounds for the MSR setting that are tighter than 
those proposed in [33| . The authors also present a secure MSR code construction for d = n — 1 which employs 
Zigzag codes GD along with a maximum rank distance code. These codes are optimal for d = n — 1, m < 2. 
The secrecy capacity at the MSR regime for m > 2 remains open. In (6T| , the authors provide a tighter upper 
bound on the file size that can be stored in a system secure from passive eavesdroppers for codes employing 
linear encoding and decoding at the MSR point. The code construction provided in |48| meets the upper bound 
for linear codes provided in (6l]| for d = n — 1. 


Security from active adversaries is considered in [60] . A different adversarial model, where an active adversary 
can replace the content of an affected node only once, is considered and bounds and achievable schemes for 
this setting are provided. The paper also considers the MSR setting providing security from an active adversary 
who can replace the contents of affected nodes an unbounded number of times, and provides schemes that are 
optimal for a specific choice of the parameters. In (62} , the authors provide information theoretic upper bounds 
for the cases n < 4 and k = d = n — 1. For these parameters, it is shown that MBR is the one and only 
regime for which the resource requirements in the system do not increase when, in addition to security from an 
eavesdropper being able to read data stored in nodes, security from an eavesdropper who can also tap on to the 
data downloaded during repair is desired. Security in distributed storage systems with heterogeneous storage 
nodes is considered in 


In (64}, the authors deal with byzantine fault tolerance by employing product-matrix codes [11]. They use a 
cyclic redundancy check (CRC) to check the integrity of data during repair and reconstruction, and a feedback 
scheme to iteratively correct them. However, CRC based schemes are not applicable in the present setting of 
information-theoretic security since the CRC may also be corrupted by the adversary. The present paper takes 
a more fundamental look at the problem of handling corruptions in regenerating codes from an information- 
theoretic perspective. 


In [65J, the authors derive bounds to determine the secure capacity in a ‘cooperative-repair’ setting [66]. The 
bounds of (65} show that such an attempt to cooperatively repair may adversely affect the system in the presence 
of malicious adversaries. In |67| authors study the security of distributed storage systems in the presence of 
a trusted verifier. In [[68}, authors propose using polytope codes to address the issue of security against active 
adversaries. 


B. Shamir’s secret sharing 

A possible method to ensure information-theoretic security from passive eavesdroppers is to employ Shamir’s 
secret sharing scheme (69}. Under Shamir’s secret sharing scheme, the data is encoded and stored in a set of 
n nodes such that the entire data can be recovered from any k nodes, while access to data in any ( k — 1) 
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or fewer nodes provides zero information about the data. Now, during repair of a failed node, this scheme 
requires a download of the entire data to a central location, following which the data in the replacement node 
is re-encoded. Thus, as in the case of classical erasure codes, the repair operations are inefficient, mandating 
significant network resources. Furthermore, this central location represents a single point of failure, and security 
in such a system can be compromised by an eavesdropper who gains access to this location. These critical issues 
thus necessitate investigation of alternative solutions that account for the routinely performed failure-handling 
tasks. 


C. Secure Network Coding 


The literature on secure network coding (e.g., f35| , (39), (411, (42] l) primarily considers a multicast setting in 
which there is a single source of data and every destination is interested is obtaining all the data transmitted by 
the source. Furthermore, with respect to security from passive eavesdroppers in the multicast setting, only the 
scenarios where the eavesdropper can access subsets of links is well understood in the literature. The problem 
of secure distributed storage considered in this paper requires handling the case when nodes are compromised. 
The problem of node-compromise is typically treated as a case of link-compromise by assuming that the 
eavesdropper gains access to all links that are incident upon the compromised nodes. In our problem, since 
a node may be repaired by connecting to any d nodes, schemes following such an approach cannot have a 
non-zero rate of transmission in general. 


Even in the case of active adversaries, the problem of handling compromise of nodes is much harder [40] 
than handling compromise of links. Furthermore, our problem falls into the harder setting of non-multicast, 
whereas the literature on secure network coding primarily considers the multicast setting. The works [[37], [38] 
present results showing guarantees of error correction under random linear network coding under the condition 
that the subspace obtained at the receiver has a sufficient intersection with the transmitted subspace, a condition 
that is not guaranteed in our setting. In (36) , the authors propose schemes to transmit a message equal to the 
difference between the largest message that can be sent in the absence of secrecy requirements and a bound 
on the number of compromised links. In our problem, this difference is almost always 0 or smaller since a 
compromised node may assist in any number of repairs of any other node. The results of the present paper, on 
the other hand, establish the capacity of a class of non-multicast networks in the presence of active adversaries 
or passive eavesdroppers having the ability to compromise nodes. Interestingly, these capacity-achieving codes 
that we propose in this paper are linear, deterministic and explicit. 


The problem of providing information-theoretic secrecy in distributed storage systems is also related to 
the Wiretap Channel II (70) where an eavesdropper, who is listening to any arbitrary subset of symbols (of 
fixed size) being transmitted over a noiseless point-to-point channel, should obtain no information about the 
original message. While schemes providing secrecy in a distributed storage system with only the reconstruction 
requirement would follow directly from [70], the requirement of addressing node-repair makes the problem 
non-trivial. 


IV. The Product-Matrix Framework 

The secure codes presented in this paper are based on the product-matrix codes [11], and this section describes 
their underlying framework. These codes operate at both extremities of the storage-bandwidth tradeoff for the 
parameters: 

1) MBR: all parameters [n, fc, d\, and 
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Fig. 4: An example illustrating the product-matrix framework. The encoding of an MBR code for parameters 
[n = 5, k = 2, d = 2], (B = 3, a = 2, /? = 1) and ^ = m = 0 (i.e., no requirement of security from passive 
eavesdroppers) is depicted. The field of operation is F 5 . Also depicted is the repair of the first node with p = 0 
(i.e., no requirement of security from active adversaries). 


2) MSR: all parameters [n, fc, d > 2k — 2] . 

Observe from ([3]) and ([4]) that for each of these cases, the values of B and a are multiples of (3. It follows 
that given an optimal code for /? = 1, optimal codes for any higher value of (3 (say, /?o) can be obtained by 
simply concatenating the (3 = 1 code /3o number of times. This process is known as striping. Thus in this paper, 
without loss of generality, we consider only the case of (3 = 1 . 

Product-matrix codes are represented in terms of an (n x a) code matrix 

C = 4>M , (12) 


where 

• the 4 code matrix ’ C is an ( n x a) matrix with its ith row comprising the a symbols stored in node 

i (1 < i < n), 

• the 4 message matrix ’ M is a (d x a) matrix, whose elements comprise all the message symbols, along 
with some random symbols, arranged in a specific (and possibly redundant) manner, and 

• the 4 encoding matrix ’ 4/ is a fixed (n x d) matrix. Denoting the i th row of 4/ by ipf, the a symbols stored 
in node i are 

^JM . 

The d length vector ipf is termed as the encoding vector of node i. 

Thus each node stores the inner product of its encoding vector with the message matrix. The specific structures 
of the matrices M and 4/ vary with the choice of the operating regime (MBR or MSR) and the level of 
security required, and their design is described in the subsequent sections. Fig. [4] depicts an example of a code 
constructed under the product-matrix framework. 

Systematic Codes: A systematic code is one wherein some set of k nodes together store all the B message 
symbols in an uncoded form. This set of k nodes are called systematic nodes. In several applications of interest, 
it may be desired to have the code in a systematic form, since decoding of the data is not required whenever 
the data-collector connects to the k systematic nodes. 
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The codes for security from active adversaries that presented in this paper can be constructed in both 
systematic or non-systematic forms. On the other hand, a code operating under the requirement of information- 
theoretic security from passive eavesdroppers cannot be systematic. This is because, if any part of the message 
is stored in an uncoded form in any node, an eavesdropper tapping onto that node would obtain a non-zero 
amount of information about the message. 

The following sections employ the product-matrix framework to construct secure MBR and MSR codes. 


V. MBR codes Secure from Active Adversaries 


MBR codes achieve minimum possible download during repair: a replacement node downloads only what it 
stores, resulting in d/3 = a. Recall from Section |II-A| (see ([3])) that in the absence of security requirements, an 
MBR code must satisfy 

~ (k ~ 1)P 


B 

a ~J + 


d/3 = a . 


(13) 


In this section we present explicit constructions of optimal MBR codes for all parameter values [n, fc, d], 
providing security from active adversaries. As discussed in Section |II-C[ our code constructions, in addition to 
being optimal, provide on-demand security. These codes are optimal in the sense that they achieve GD with 
equality for every value of the desired security level p. 


A. Encoding algorithm 


The encoding procedure of the codes for security from active adversaries is identical to that of the product- 
matrix codes without security requirements |TTJ, and is independent of the desired protection levels. This 
procedure is briefly described below for the sake of completeness. 


We apply the striping procedure described in Section [TvJ and construct codes for /3 = 1 . Identical encoding, 
repair and reconstruction operations are performed independently on each stripe of /? = 1. Setting f3 = 1 in © 
gives 

+ d = a ' Z 5 = 1 • ( 14 ) 


Recall from Section IV that the product-matrix codes are represented in terms of an (n x a) code matrix 


C = ^M 


(15) 


where the i th row of C contains the a symbols stored in node i (1 < i < n). Thus the i th node stores ijjf M 
where i/jJ is the i th row T', and is termed the encoding vector of node i. 

The encoding matrix \I/ for the MBR code is of the form 


= [$ E] , (16) 

where is an (n x k) matrix, and E is an (n x (d — k)) matrix. The matrices and E are chosen such 
that: (i) any k rows of are linearly independent, and (ii) any d rows of ^ are linearly independent. These 
requirements can be met, for example, by choosing \I> to be either a Cauchy or a Vandermonde matrix (7TJ. 
Observe that these properties make the generator matrix of an [n, k\ MDS code, and T' the generator matrix 
of an [n, d] MDS code. The choice of the finite field is restricted (only) by the matrix choosing 'ft as a 
Vandermonde matrix permits any q > n. 
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The message matrix M contains the B message symbols arranged in a specific form. As mentioned previously, 
in our approach, the encoding mechanism is independent of the number of compromised nodes in the system. 
Hence the number of message symbols is equal to that in the regenerating codes setting with no security 
requirements ( [T4| ): 

„ 7 7 k(k-l) 7 , 7 7 . k(k + 1) 

B — kd— —- = k(d -k) + - A ——- . 


The (d x d) message matrix is designed to be of the form 


M = 


S V 
V T 0 


(17) 


(18) 


where S is a (k x k) symmetric matrix, and V is a (k x (d — k)) matrix. The B message symbols populate 
distinct entries of S and the k(d — k) entries of V. Observe that this specific structure makes the 


the 


k(k+ 1) 


message matrix M symmetric. 

In the MBR toy example provided in Fig. [4j with [n = 5, k = 2, d — 2] and (/? = 1, a = 2, B = 3), the 
(5 x 2) encoding matrix 4/ and the (2 x 2) message matrix M take the values: 


= 


M — 


a b 
b c 


with a, 6, and c as the three message symbols. 

Remark 2 (Systematic code): The PM-MBR code can be made systematic by choosing the encoding matrix 
^ as 


= 


4 

4 > 


o 

E 


(19) 


where 4 is th e (k x k) identity matrix, 0 is a (k x (d — k)) zero matrix, and E are matrices of sizes 
((n — k) x k) and ((n — k) x (d — k)) respectively, such that <f> E is a Cauchy matrix (or any matrix such 
that all of its submatrices are of full rank). The example in Fig. [4] depicts a systematic product-matrix MBR 
code. 

This completes the description of the encoding procedure. We now present the decoding algorithms that 
ensure security from active eavesdroppers. 


B. Algorithm for Secure Node Repair 

The following theorem presents an explicit algorithm for node-repair in the presence of compromised nodes. 

Theorem 1 (Secure node-repair): In the code presented, for any value of parameter p , the repair of a failed 
node can be secured from the compromise of up to p nodes, by letting the replacement node download /3 = 1 
symbols each from (d+2p) arbitrary nodes. The amount of data downloaded in this processes is minimum, thus 
establishing the secure capacity of such systems. The choice of parameter p is arbitrary, and can be different 
for every instance of repair, thus providing on-demand security. 

Proof: Suppose during any instance of node-repair, it is desired to provide security from the compromise 
of up to p nodes, for some choice of p. Let / denote the failed node and 'iff be its encoding vector. The a (= d) 















17 


symbols stored in the failed node are 

rfM . 

The replacement node then connects to some arbitrary {d + 2 p) nodes. The replacement node is required to 
recover these d symbols by downloading /? = 1 symbol each from the (d + 2 p) nodes that it connects to, when 
at most p of these nodes are compromised. Let J denote the set of these [d + 2 p) nodes. Under our repair 
algorithm, each of these (d + 2p) nodes pass the inner-product of the a values stored in them with the encoding 
vector of the failed node /. That is, for every j G J , node j passes the value of 

'ipjM'ipf . 

This value may be corrupt if node j is compromised to the adversary. Let denote the ((d + 2p) x d) 
submatrix of with its (d + 2p) rows comprising the vectors {'fJ}jej- Under this notation, one can see that 
the replacement node has access to the {d + 2 p) encoded symbols 

of which at most p symbols may be in error. This set of symbols can equivalently be viewed as an encoding 
of Mi/jf by the generator matrix \I> j . To complete the decoding process, we now call upon the MDS property 
of matrix By construction, ^ is the generator matrix of an [n, d\ MDS code. Hence its submatrix is 
the generator matrix of a [d + 2p, d\ MDS code. This implies that the code with as its generator matrix 
has a minimum distance of (2 p + 1). Thus, when at most p of the (d + 2 p) symbols 'S'jM'ipf are corrupt, the 
replacement node can apply standard decoding algorithms for MDS codes and recover Mx/jf correctly. Finally, 
since the message matrix M is symmetric, the d elements comprising Mi/jf can be written as 

(Mipf) T = ijpjM , 

which are precisely the d values that were stored in the failed node. 

The repair algorithm described above required the replacement node to connect to some arbitrary A = (d+2p) 
nodes, and download /3 = 1 symbol from each. The amount of storage and repair-bandwidth consumed meet ( [17] ) 
and ([5]) with equality, thus establishing optimality. ■ 

Example 1: Fig. [3] depicts an example of this algorithm where repair of the first node is performed, while 
providing protection from up to p = 1 compromised nodes. The encoding process of the code is illustrated in 
Fig-0 For the repair of node 1 as shown in the figure, the replacement node connects to A = d + 2p = 4 
other nodes and each of these nodes pass an inner-product of their data with ij;i = [1 0], i.e., pass their first 
symbol. The data {6, a + 6, a + 26, a + 36} is an MDS encoding of {a, 6} allowing for the correction of up 
to one arbitrary error. 


C. Algorithm for Secure Data Reconstruction 

The following theorem presents an explicit algorithm for data reconstruction in the presence of compromised 
nodes. 

Theorem 2 (Secure data-reconstruction): In the code presented, a data-collector can recover all the B mes¬ 
sage symbols by downloading data stored in (k + 2 p) arbitrary nodes, in the presence of up to p compromised 
nodes. The choice of parameter p is arbitrary, and can be different for every instance of data-reconstruction, 
thus providing on-demand security. 
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Proof: Suppose during any instance of data-reconstruction, it is desired to provide protection from the 
compromise of up to p nodes, for some choice of p. The data-collector then connects to (k + 2 p) arbitrary 
nodes. Let X denote the set of (k + 2 p) nodes to which the data-collector connects. Then, from every node 
i G X, the data collector downloads the a symbols 

ViM , 


which could be corrupt if node i is compromised. Let denote the ((k + 2 p) x d ) submatrix of T', with its 
(k + 2 p) rows comprising From the structure of the (see ©) we have 




Sr 


( 20 ) 


where Ty is a ((A: + 2p>) x A:) matrix and E^ is a ((A: + 2p) / (d — k )) matrix. The (A: + 2p) rows of and E x 
respectively are comprised of (fc + 2//) of the rows of <!’ and E (indexed by X). From the specific structure ( fT8] ) 
of the message matrix M, we see that the data-collector equivalently has access to the encoded symbols 


= 


S + X I V T $ X V 


Note that the (k + 2 p) rows of the matrix T ' X M are obtained from (k + 2 p) different helper nodes. Thus, a 
compromise of up to p of the helper nodes to an active adversary leads to at most p of the rows of ^ X M being 
corrupt. 

We now call upon the MDS property of $ to complete the reconstruction process. Recall that by construction, 
$ is the generator matrix of an [n, k\ MDS code. Hence its submatrix $ x is the generator matrix of an [&+2p, k\ 
MDS code. It follows that a code with $ x as its generator matrix has a minimum distance of (2p+ 1), and thus 
has the ability to correct up to p arbitrary errors. Consider the set of symbols $ X V that the data collector has 
obtained. Observe that any column of the matrix & X V is an encoding of the corresponding column of V with <f> x 
as the generator matrix. Thus, when at most p of the nodes are compromised, the data-collector can correctly 
decode each column of & X V separately using standard algorithms for decoding MDS codes. At the end of this 
process, the data-collector recovers the matrix V correctly. The next step is to decode S from the downloaded 
data. Since the value of V is (correctly) known, the data-collector can subtract from ($ x 5 + YjV 1 -) to 

recover $ x 5. Again, the matrix <& X S is an encoding of the columns of S by the generator matrix <h x , which 
allows for the decoding of S' in a manner identical to the decoding of V even in the presence of upto p corrupt 
rows. 

In this setting, one can verify that the amounts of storage and repair-bandwidth consumed meet ([9]) with 
equality, thus establishing optimality. ■ 


VI. MBR codes for Security from Passive Eavesdroppers 


In this section, we present explicit constructions of MBR codes secure from passive eavesdroppers that 
support all values of [n, fc, d\ and all {£, m}. As discussed previously, unlike our codes for security from 
active adversaries, the codes for security from passive eavesdroppers do not provide on-demand security. As a 
result, the desired level {^, m} of security must be chosen apriori at the time of encoding. The MBR regime 
is governed by the relation a = dfd, and substituting this in ([5]) gives an upper bound on the amount of data 
L>* that can be stored in a {^, m}-secure MBR code as 


5 * < 


kd — 


k(k- 1) 



£(£- 1 ) 


P • 


2 


2 


( 21 ) 
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Fig. 5: Structure of the [d x d) message matrix M* for the {£, m}-secure MBR code. The matrix is symmetric 
and comprises of blocks of data, random symbols, and zeros. The code is given by C* = \P*M* where T'* is 
the encoding matrix. 


The codes presented here achieve this bound with equality 


5 * 


kd — 


k(k — 1) \ 

2 J 



i(i-in 


p , 


( 22 ) 


thus establishing this as the secure capacity under the MBR regime for all values of the parameters [n, fc, d\ 
and m}. 

When any parameter takes different values in the secure and non-secure versions, we will denote the parameter 
associated to a code providing security with a superscript asterisk (*) as done in ( [2T] ) and ( [22] ). 


A. Encoding 

We will now construct an {^, m}-secure MBR code satisfying ( |22| ). We apply the striping procedure described 
in Section pVl and (without loss of generality) construct codes for the case /3 = 1. Setting 0 = 1 in ([22]) gives 


B* = I kd- 


k(k- 1) 


- Zd- 


^-i) 


OL — d , 0 = 1 


The product-matrix codes for security from eavesdroppers possess a structure identical to those without 
security requirements. The difference between the two codes is that in the absence of security requirements, 
the elements of the message matrix M consist of the message symbols, whereas in the situation when security 
is needed, a specific set of elements of the message matrix are replaced by random symbols. Thus, to construct 
an {£, m}-secure code for any [n, fc, d], first consider an [n, fc, d] product-matrix MSR code in the absence 
of security requirements, as constructed in Section |V-A| Denote the this code by C, and further, let C* denote 
the {£, m}-secure code (which we will construct below). The code C is described by an {n x a) code matrix 
C = M as defined in ( fl5] ), with the {d x d) matrix M comprising B — kd — message values 

Following the product-matrix framework, the code C* is also described by an (n x a) code matrix C* of the 
form C* = ^*M*. The matrices T'* and M* for code C* are obtained by modifying the matrices ^ and M of 
code C in a manner described below. 


The (n x d) encoding matrix T'* is required to satisfy the following property in addition to those required by 
when restricted to the first l columns, any i rows are linearly independent. The choice of T'* as a Cauchy 
or a Vandermonde matrix satisfies this property as well. 
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The (d x d) message matrix M* is obtained by replacing the 

R = B-B* = id- 221 (23) 

message symbols in the first £ rows (and hence first £ columns) of the symmetric matrix M by R random 
symbols. Each random symbol is drawn independently and uniformly from ¥ q . The structure of the resulting 
message matrix is illustrated in Fig. [5j 

Finally, node i (1 < i < n) stores the i th row of the (n x a) code matrix (7* = T^M*. 


The following example illustrates the encoding procedure. 


Example 2: Consider the code depicted in Fig. [2] providing security from passive eavesdroppers. This code 
operates in the MBR regime with parameters n = 3, k = 2 and d = 2, and provides security against 
eavesdroppers who may gain access to [£ = 1, m = 1) nodes. Let the alphabet of operation be F 3 and 
let a = 2 and (3 = 1. From ( [22] ) we get that the maximum size of the message that can be stored securely in 
this system is L>* = 1 symbol (we could have stored B — 3 symbols in the absence of security requirements). 
Let us denote this solitary message symbol as a. Following ( [23] ), we draw R = 2 symbols r\ and 7*2 uniformly 
at random from F3. The encoding vector of any node i (1 < i < 3) under this code is ipj = [1 i\. The message 
matrix is ^ 


M* 


n 

T2 

T2 

a 


Each node i stores ipf M* as shown in Fig. [2j 


B. Reconstruction, Repair and Security 


The following theorems prove the properties of reconstruction, repair and security. 

Theorem 3 (Data-reconstruction and node-repair): In code C*, a data-collector can recover all the L>* mes¬ 
sage symbols by downloading the data stored in any k nodes, and a failed node can be repaired by downloading 
(3 = 1 symbol each from any d remaining nodes. 

Proof: Treating the random symbols also as message symbols, code C* becomes identical to C. Thus 
reconstruction and repair in C* are identical to that in C. ■ 


Note that repair and reconstruction can be carried out ensuring security from corruption by active adversaries 
by employing the explicit algorithms of Section |V-B| and Section |V-C| 


Theorem 4 (Information-theoretic security): The code C* ensures {£, m}-security, i.e., an eavesdropper hav¬ 
ing access to the data stored in up to £ nodes, and to all the data passed to up to m of these nodes during one 
or more of their repair operations, gets no information about the message. 


Proof: The complete proof is provided in the appendix, and we provide an outline here. Let U , TZ and 
£ respectively denote random variables corresponding to the message, the random symbols inserted at the 
encoding stage, and the data available to the eavesdropper respectively. The proof proceeds in three steps: first 
it is shown that H(1Z\£,U) = 0. The second step is to prove H{£) < R. The final step shows that these two 
conditions suffice to guarantee complete security, i.e., I(U\ £) = 0. ■ 
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VII. MSR Codes Secure from Active Adversaries 

MSR codes use the minimum possible storage at each node. The data-reconstruction property necessitates 
that the data from any k nodes suffice to reconstruct all the B message symbols. As a result, each node must 
necessarily store at least a fraction | of the entire message. Hence for an MSR code we have a = f- To meet 
the bound (JT|) with equality (in absence of errors/erasures), an MSR code must satisfy 

B = ka (24a) 

d/3 = a + (k - l)/3 . (24b) 


In this section we present explicit constructions of optimal MSR codes for all parameter values [n, fc, d> 
2k — 2] providing on-demand security from active adversaries. We employ the striping procedure described in 
Section [IV] and without loss of generality construct optimal codes for P = 1. 

Since our codes are required to provide on-demand security, the encoding procedure is independent of the 
levels of security desired and is identical to the case when no security is required. 

We first construct the code for the case d = 2k — 2 and then use the shortening technique of (TO), (11] 
to obtain codes for all d > 2k — 2. As we will see subsequently, the decoding algorithms and the security 
properties of the code for d > 2k — 2 follow directly from the corresponding properties of the d = 2k — 2 code. 


A. Encoding algorithm for d —2k — 2 


In the MSR regime with d — 2k — 2 and P — 1, from ( [24] ) we have 


k — oi - j-1 5 

d = 2a , 

B — cy(cr -\- l) . 


(25a) 

(25b) 

(25c) 


The product-matrix MSR code is described by an (n x a) code matrix C of the form C = $M, with an 
(n x d) encoding matrix 4/ and an (d x a) message matrix M. The MSR code differs from the product-matrix 
MBR code in the specific design of the matrices 4/ and M. 

Under the MSR regime, the (n x d) encoding matrix 4/ is of the form 

^ = [$ A$] , (26) 

where 4> is an [n x a) matrix and A is an {n x n ) diagonal matrix. The matrices 4> and A are chosen such 
that: (i) any a rows of 4> are linearly independent, (ii) any d rows of 4/ are linearly independent, and (iii) the 
diagonal elements of A are all distinct. These requirements can be met, for example, by choosing 4/ to be a 
Vandermonde matrix with elements chosen carefully to satisfy the third condition: the zth (1 < i < n) row of 
4>, ipj = [1 xi x\ • • • is chosen such that xf / xf V 1 < i\ < < n. Observe that these properties 

make 4> the generator matrix of an [n, a] MDS code, and 4/ the generator matrix of an [n, d\ MDS code. 

The choice of the matrix 4/ poses the only restriction to the choice of the finite field ¥ q . For instance, 
choosing 4/ as a Vandermonde matrix (in the manner described above) permits any q> n 2 . 


We will now specify the design of the message matrix M. From ( 25b) and (25c| ), we have d = 2a and 
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Node 
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Node 1 

a 







b 
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d 
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c 

+ 

e 

+ 

f 

Node 3 
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+ 

3b 

+ 

9d 

+ 

e 

b 

+ 

3c 

+ 

9e 

+ 

f 

Node 4 

a 

+ 

2b 

+ 

4d 

+ 

8e 

b 

+ 

2c 

+ 

4e 

+ 

8f 

Node 5 

a 

+ 

6b 

+ 

lOd 

+ 

8e 

b 

+ 

6c 

+ 

lOe 

+ 

8f 

Node 6 

a 

+ 

5b 

+ 

12d 

+ 

8e 

b 

+ 

5c 

+ 

12e 

+ 

8f 

Node 7 

a 

+ 

4b 

+ 

3d 

+ 

12e 

b 

+ 

4c 

+ 

3e 

+ 

12f 


Fig. 6: An MSR code for [n = 7, k = 3, d = 4]. The message is {a, 6, c, d, e, /} and the finite field of operation 
is F 13 . In the absence of security requirements (p = 0), repair of any node / (1 < / < 7) is performed by 
connecting to any d — 4 nodes and downloading the inner product of the two symbols in each node with 
[1 Xf\, where Xf is the / th element of x = [0 1 3 2 6 5 4]. The repair of any node / (1 < / < 7), with 
security from compromise of any p = 1 node, is performed by connecting to any {d + 2 p) = 6 nodes and 
downloading the inner product of the two symbols in each node with [1 Xf\. The code is optimal: it requires 
the minimum possible storage (it is MDS), and for this storage, it requires the minimum possible amount of 
download in each of the aforementioned scenarios. 


B = a (a + 1) respectively. The (d x a) message matrix M is constructed as 



( 27 ) 


where Si and S 2 are (a x a) symmetric matrices. The matrices Si and S 2 together have precisely a (a + 1) 
distinct entries, which are now populated by the B = a(a + 1) message symbols. Thus, under this encoding 
mechanism, node i (1 < i < n), stores the a symbols 


M = 4>JSi + \i4>JS 2 . 


This encoding procedure is illustrated below with an example. This example will subsequently be reused to 
illustrate secure node-repair. 

Example 3: Let us construct an MSR code for [n = 7, k = 3, d — 4], With (3 = 1, the associated parameters 
are (a = 2, B = 6). Note that the chosen parameters k and d satisfy the relation d = 2k — 2. Let us operate 
over the finite field F 13 , and choose the encoding matrix T to be a Vandermonde matrix. Further, let us denote 
the 6 message symbols as {a, b , c, d, e, /}. The matrices 'P and M take values 


Tr = 


'10 0 O' 

1111 
13 9 1 

12 4 8 

1 6 10 8 
1 5 12 8 

1 4 3 12 


M = 



The code is given by C = \PM. The resultant code is depicted in Fig. [6] 
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Remark 3 (Systematic code): Although the encoding mechanism described above does not directly result in 
a systematic code, it can be converted easily to systematic form by following a procedure that will be described 
subsequently in Section |VII-D| 

This completes the description of the encoding procedure. We now move on to present decoding algorithms 
for node-repair and data-reconstruction with on-demand security. 


B. Algorithm for Secure Node Repair 

The following theorem presents an explicit algorithm for node-repair that ensures security in the presence of 
compromise of nodes to an active adversary. 

Theorem 5 (Secure node-repair): In the code presented, for any desired value of p, the repair of a failed 
node can be secured from the compromise of up to p nodes, by letting the replacement node download (3 = 1 
values each from (d + 2 p) arbitrary nodes. The amount of data downloaded in this processes is minimum, thus 
establishing the secrecy capacity of such systems. The choice of parameter p is arbitrary, and can be different 
for every instance of repair, thus providing on-demand security. 

Proof: Suppose during any instance of node-repair, it is desired to provide protection from the compromise 
of up to p nodes, for some choice of p. Let / denote the failed node and = [(f) f Xfff] be its encoding 
vector. The a (= d) symbols stored in the failed node are 

ip f M = ^fSi + \ f (t> T f S 2 . ( 28 ) 

The replacement node connects to some (d + 2 p) nodes. The replacement node is required to recover these d 
symbols by downloading (3=1 symbol each from the (d + 2 p) nodes that it connects to, when at most p of 
these nodes are compromised. Let J denote the set of these (d + 2 p) nodes. Under our repair algorithm, each 
of these (d + 2 p) nodes pass the inner-product of the a values stored in them with the encoding vector ifjf of 
the failed node /. That is, for every j G J, node j passes the value of 

This value may be corrupt if node j is compromised to the adversary. 

Let 'S'j denote the ((d + 2 p) x d) submatrix of 'S', with the rows of 'S'j comprising the (d + 2 p) rows of 'S' 
that correspond to the (d + 2 p) nodes in J. Then the replacement node has access to the (d + 2 p) values 

To simplify notation, let us define a quantity p as 

p = M(j) f . ( 29 ) 

In terms of this notation, the (d + 2 p) encoded symbols downloaded by the replacement node can be written 
as 'S'jp. To decode in the presence of possible corruption, we will exploit the MDS property of the matrix 
By construction, 'S' is the generator matrix of an [n, d\ MDS code. Hence, 'S'j is the generator matrix of a 
[d + 2p, d] MDS code. This implies that 'S'j has a minimum distance of (2 p-\- 1). The (rf + 2 p) symbols 'S'jp 
downloaded by the replacement node are simply an encoding of p under a code that has 'S'j as its generator 
matrix. Hence, the replacement node can recover p correctly by decoding the MDS code 'S'j in the presence of 
up to p corruptions. Thus, even in the presence of up to p compromised nodes in the system, the replacement 
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node correctly recovers 


Thus the replacement node has access to the correct 
allows for the computation of 

(tPjSi + \ f 4> T fS 2 

correctly, and this is precisely the collection of a symbols that were stored in the failed node. ■ 

The following example illustrates this repair algorithm. 

Example 4: Consider the code described in Example [3] (Fig. [6]) for parameters [n = 7, k = 3, d — 4]. 
Suppose node 2 fails and needs to be repaired with a security from the compromise of any p — 1 other node. 
From the construction described in Example [3j we see that </>J = [11]. Under the secure repair algorithm 
described above, the replacement node connects to the A = d + 2p = 6 remaining nodes. Each of these nodes 
takes an inner product of its data with the vector [1 1], i.e., passes the sum of the two symbols it stores. The data 
thereby obtained by the replacement node is a [6,4]-MDS code with the message as {a + 6, 6 + c, d + e, e + /}; 
the six symbols obtained by the replacement node are 


Mcj)f 

51 

5 2 

Sitf 


4>f 


(30) 


(31) 


values of S\(j)f and S 2 ff. The symmetry of Si and S 2 


• a + 6 (— (a + 6) + 0 (6 + c) + 0 {d + e) + 0(e + /)) 

• cl - f- 36 T 9 d 4~ 6 4~ 6 4~ 3c 4~ 9e -\~ f ( = {cl -\~ 6) T 3(6 4~ c) 4~ 9 {d T e) 4~ (e 4~ /)) 

• a T 26 T 4d 4- 8e -\~ b - 1- 2c T 4e T 8 f (= {a T 6) T 2(6 4- c) T 4(c? 4~ c) 4~ 8(e 4~ /)) 

• a 4~ 66 4“ 10<i 4~ 8e 4~ 6 4~ 6c 4~ lOe 4~ 8/ (= {a 4~ 6) 4~ 6(6 4~ c) 4~ 10(cZ 4-6)4“ 8(e 4~ /)) 

• a 4“ 56 4“ 12d 4“ 8e 4~ 6 4~ 5c 4“ 12c 4~ 8/ (= {a 4~ 6) 4~ 5(6 4~ c) 4~ 12(cZ 4- c) 4~ 8(e 4~ /)) 

• a + 46 + 3cZ 4- 12e + 6 + 4c + 3e + 12/ (= (a + 6) + 4(6 + c) + 3 {d + e) + 12(e + /)) 

The [6,4] MDS property allows for correction of one arbitrary error. It follows that the replacement node can 
correctly recover {a + 6, 6 + c, d + e, e + /} and compute its desired data (a + 6 + d + e, 6 + c + e + /) even 
in the presence of an adversary who may gain control of one arbitrary node. 


C. Algorithm for Secure Data Reconstruction 

The following theorem presents an explicit algorithm for data-reconstruction that ensures security in the 
presence of nodes compromised to an active adversary. 

Theorem 6 (Secure data-reconstruction): In the code presented, a data-collector can recover all the B mes¬ 
sage symbols by downloading data stored in (k + 2 p) arbitrary nodes, in the presence of up to p compromised 
nodes. The choice of parameter p is arbitrary, and can be different for every instance of data-reconstruction, 
thus providing on-demand security. 

Proof: Consider a data-reconstruction operation that requires security from up to p compromised nodes. 
Then, under our protocol for data-reconstruction, a data-collector connects to {k + 2 p) arbitrary nodes in the 
system and downloads the a symbols stored in each of these nodes. Let I denote this set of (k + 2 p) nodes. 
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= P + A x Q 


(a) 


(b) 


(c) 



f \ 

Legend: 

HI in error 
] correct 
unknown 

V J 


Fig. 7: An illustration of the error patterns arising during the reconstruction process under the product-matrix 
MSR code providing security from active adversaries. The pattern shown in part (a) depicts the corruption of 
some p rows (corresponding to the set A) of the received data. The remaining parts (b)-(f) illustrate the resulting 
error patterns at various steps of the decoding algorithm. Note that the p corrupt rows are shown as contiguous 
only for ease of illustration. 


Of these, let A denote the subset of nodes compromised to the adversary. To prove the correctness of our 
reconstruction algorithm for the desired protection-level p, we assume that the size of A is no larger than p. 
Of course, the set A is not known to the decoding algorithm, and is used here only to illustrate the algorithm. 

Let T'x denote the ((fc + 2 p) x d) submatrix of \I/, whose rows comprise the set X of the rows of \k. From 
the structure of the encoding matrix ^ ( [26] ), we can write T' x as 




$z Az$z 


Then the data collector has access to the encoded symbols 


^ X M 


$z Az$z c 
J 02 

&lSi + A x $ x 52 , 


(32) 


of which the p rows corresponding to A can be corrupt. An example of such an error pattern is illustrated in 

Fig. [zk 
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The data-collector multiplies this term by <I )V to obtain 


M<£ 


$z5i + 






(33) 


Note that since this operation involves only column operations, the error patterns do not change, and only the 
p rows corresponding to A continue to be corrupt. Next, define two ((a + 1) x (a + 1)) matrices P and Q as 

p = , (34) 

Q = &eS 2 o£ • (35) 

Since Si and S 2 are symmetric, it follows that the matrices P and Q are also symmetric. One can rewrite the 
data available to the data-collector in terms of P and Q as 


P + A xQ 5 

of which the p rows corresponding to A can be corrupt (Fig. [TJd). Now, for any i and j such that 1 < z, j < k+2p 
and i / j , the (z, j) th , element of the matrix P + A X Q is 


and its (j, z) th element is 


P{j Hh \iQ 


13 5 


(36) 


+ A jQji 
= A jQij 5 


(37) 


where ( [37] ) follows from the symmetry of matrices P and Q. By construction, / Aj whenever z / j, and 
hence using ( [36] ) and ( [37] ), the data-collector solves for the values of P LJ and Qij for all i A 3 (i n other words, 
the data-collector solves for all non-diagonal elements of P and Q). However, some of these computed values 
may be in error due to the corruption of p of the rows in P + A X Q. In particular, the p rows (and hence the 
p columns) corresponding to A may be in error in P as well as in Q. These error patterns in P and Q are 
illustrated in Fig. [7]c. 


We will now use the (possibly corrupt) values of the non-diagonal elements of matrices P and Q to correctly 
decode S± and S 2 respectively. Let us first consider recovery of Si, for which we will use the values of the 
non-diagonal elements of P obtained above. Consider any column j (1 < j < k + 2 p) of the matrix P, and let 
J = I\{j}. The j th column of the matrix P, excluding the element on its diagonal, is 


and is of length {k + 2p — 1) = (a + 2 p). Recall that by construction, $ is the generator matrix of an [n, a] 
MDS code. Hence its submatrix is the generator matrix of an [a + 2p, a] MDS code. It follows that 
has a minimum distance of (2 p + 1). The symbols QjSifij can be considered as an encoding of the vector 
S\(j)j using the generator matrix Using the MDS property of <Fj, the algorithm attempts to decode 
assuming the presence of no more than p errors. Recall from the discussion above that the vector S\(j)j has 
at most p corrupt entries if j ^ A, but it can be entirely corrupt if j G A (see Fig. [7]l). As a result, the 
data-collector obtains the correct value of S\&j if j A, and a possibly corrupt value if j G A as shown in 
Fig. [7^ (of course, the data-collector still does not know A and hence does not know if the value obtained is 
correct or not). However, when the size of A is at most p, at most p of these decoded vectors {Si(f)j}^^ p+1 
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may be in error. Now, these vectors comprise the columns of the matrix as seen in Fig. [ 7 Again, since 

is the generator matrix of an [a + 2p + 1, a] MDS code, one can decode the ( a x a) matrix Si from Si$J 
in the presence of up to p erroneous columns. Thus, the data-collector correctly recovers Si. 

Following steps identical to recovery of Si from P, the data-collector also correctly decodes matrix S2 from 
the non-diagonal elements of Q. In this manner, the data collector correctly recovers all the message symbols 
in the presence of at most p compromised nodes. ■ 


D. Conversion to systematic form 

Any MSR code where failed nodes are replaced by their exact replicas can be converted to systematic form 
through a remapping of the source symbols. An explicit algorithm to convert the product-matrix MSR codes 
was provided in GD* and is briefly reproduced here for subsequent use. Let denote the (k x d ) submatrix 
of matrix \I> with its k rows comprising the first k rows of \I/. Let U be a (k x a) matrix consisting of the 
B (= ka) message symbols as its ka entries. Now, choose the message matrix M to satisfy 


VhM = U , 


(38) 


while maintaining the structure of M as given by 


The value of M in ( [38] ) can be obtained by using the decoding algorithm for data reconstruction (for p = 0). 
With this M as the message matrix, under the encoding algorithm of Section |VII-A[ the first k nodes will 
store the data T'&M. This is precisely the matrix of uncoded message symbols U, thus making these nodes 
systematic. Note that following this remapping, the node repair algorithm remains the same. Data reconstruction 
from the k systematic nodes requires no computation, while decoding from parity nodes requires the additional 
step of recovering U from M by pre-multiplication with as in 


Example 5: Fig. [8] depicts an example of a systematic product-matrix MSR code providing (on-demand) 
security from malicious adversaries. This code is obtained by converting the code of Example [3] (Fig. [6]) into 
a systematic form by remapping the source symbols. 


E. Shortening for d > 2k — 2 


A shortening technique under the MSR regime for constructing low-redundancy codes from the high- 
redundancy codes was introduced in [10], [13|. This technique is employed in CD to extend the d — 2k — 2 
product-matrix MSR code to the parameter range of d > 2k — 2. In this section we construct secure MSR codes 
for d >2k — 2 that meet ( |24| ) by applying this shortening technique to the secure MSR codes of Section 

Since the codes are required to provide on-demand security, the shortening procedure to obtain secure codes 
for d > 2k — 2 is independent of the desired security levels, and is identical to the corresponding procedure in 
the absence of security dD Section V-C]. We briefly describe this procedure and show that these codes provide 
optimal security from active adversaries. 


VII-A 


Suppose for some positive integer i, one wishes to construct a product-matrix MSR code with parameters 
[n, k, d = 2k — 2 + i]. Moreover, it is desired to construct an optimal code, i.e., a code whose parameters 
(L>, a , /3) satisfy ( [24] ). As the first step in this task, we construct a product-matrix MSR code for the parameters 
[n' = n + i, k' = k + i, d' = d + i = 2k — 2 + 2i\. Note that this parameter set satisfies d! — 2k' — 2, and one can 
employ the encoding procedure described in Section |VII-A| for this construction. Next, this code is converted 
to a systematic form as described in Section VII-D Let us denote the resultant systematic code by C . In order 
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Fig. 8 : A systematic MSR code for (n = 7, k — 3, d — 4). The message is {a, 6 , c, rf, e,/} and the finite 
field of operation is F 13 . In the absence of security requirements (p = 0), repair of any node / (1 < / < 7) is 
performed by connecting to any d — 4 nodes and downloading the inner product of the two symbols in each 
node with [1 Xf\, where Xf is the / th element of x = [0 1 3 2 6 5 4]. The repair of any node / (1 < / < 7), 
with security from compromise of any p = 1 node, is performed by connecting to any {d + 2 p) = 6 nodes and 
downloading the inner product of the two symbols in each node with [1 Xf\. The code is optimal: it requires 
the minimum possible storage (it is MDS), and for this storage, it requires the minimum possible amount of 
download in each of the aforementioned scenarios. 


to obtain the desired [n, fc, d = 2k — 2 + i] code C, we observe from ([24]) that the number of message symbols 
B' in code C is [k + i)(k — 1 + i), and the number of message symbols B in code C is k(k — 1 + i). Taking a 
cue from this expression, to obtain code C from code C', we set the i(k — 1 + i) message symbols stored in the 
first i (systematic) nodes in code C to zero. The data stored in the remaining n nodes, which is an encoding 
of the remaining B = k(k — 1 + i) message symbols, constitutes the desired [n, fc, d = 2k — 2 + i\ code C . 

Example 6: Fig. [9] depicts an example of a product-matrix MSR code providing (on-demand) security from 
malicious adversaries for a parameter set satisfying d > 2k — 2. This code is obtained by shortening the code of 
Example [5] (Fig. [8]) by eliminating message symbols {a, b} from the code of Example [5] The message symbols 
{c, d, e, /} in the code of Example [5] (Fig. [6]) correspond to {a, 6, c, d} respectively in Fig. [9] 

The following theorem provides explicit algorithms for secure reconstruction and repair algorithms for this 
code. 

Theorem 7 (Secure data-reconstruction and node-repair): In the code presented, a data-collector can recover 
all the B message symbols by downloading data stored in (fc + 2p) arbitrary nodes, and a replacement node can 
recover all the data stored in the failed node by downloading (3 = 1 symbols each from (d+2p) arbitrary nodes, 
in the presence of up to p compromised nodes. The choice of parameter p is arbitrary, and can be different for 
every instance of data-reconstruction and node-repair, thus providing on-demand security. 

Proof: To see how reconstruction and repair are performed in code C, we can pretend to operate under 
code C', and assume that a user (or replacement node) always connects to the first i nodes in addition to the 
k (or d) nodes that it chooses in C. The data in the first i nodes is known to be all zero, and hence cannot be 
erroneous. Thus, reconstruction or repair in C with a security level of p is identical to that in C with the same 
security level p. It follows that the reconstruction and repair operations in code C are identical to those in C. 


Optimality is a result of the parameters of C satisfying the bound ( [24] ). 
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Fig. 9: A systematic MSR code for (n = 6 , k = 2, d = 3). The message is {a, 6 , c, g?} and the finite field 
of operation is F 13 . In the absence of security requirements (p = 0), repair of any node / (1 < / < 6 ) is 
performed by connecting to any d = 3 nodes and downloading the inner product of the two symbols in each 
node with [1 Xf\, where Xf is the / th element of x = [0 1 2 6 5 4]. The repair of any node / (1 < / < 6 ), 
with security from compromise of any p = 1 node, is performed by connecting to any {d + 2 p) = 5 nodes and 
downloading the inner product of the two symbols in each node with [1 Xf\. The code is optimal: it requires 
the minimum possible storage (it is MDS), and for this storage, it requires the minimum possible amount of 
download in each of the aforementioned scenarios. 


VIII. MSR codes for Security from Passive Eavesdroppers 


Recall from ( |24| ) that in the absence of secrecy requirements, the MSR regime has 


d/3 — a + (k — 1 )/3 . 


(39) 


Unlike the MSR codes for security from active adversaries, the codes for security from passive eavesdroppers 
do not provide on-demand security. Thus, one needs to fix the desired level of security {£, m} at the time of 
encoding. 

In this section we present explicit constructions of MSR codes for all parameter values [n, fc, d > 2k — 2] 
and all {^, m) providing information-theoretic security from passive eavesdroppers. The {^, ra}-secure MSR 
codes constructed in this paper achieve 

5* = {k-i)(a-mP) 

— {k — i)(d — k + 1 — m)/3 . (40) 

Note that ( |40| ) meets the lower bound 0 for m < 1 making our codes optimal for this regime and also 
establishing the secure capacity of this parameter regime. 

For a specified set of parameters [n, fc, d], let R denote the difference in the number of message symbols 
B that can be stored in a system without security, and the number T>* that can be stored in the {£, m}-secure 
system constructed here. From ( [24] ) and ( |40| ), the value of R is given by 

R = B-B * 

m(k — £)l3 + £a. (41) 


We now describe the construction of the secure product-matrix MSR code, which is performed by replacing 
precisely R message symbols in the code of VII-A| by R random symbols. As done in Section |VII[ we will 
first consider the case of d = 2k — 2, and subsequently extend it to d > 2k — 2 via a shortening procedure. 
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Fig. 10: Structure of the message matrix M* for an MSR code secure from passive eavesdroppers, when 
d = 2k — 2. The matrix M* for the code providing security from passive eavesdroppers is obtained from the 
message matrix M of the code without security requirements by filling up a specific set of positions (which 
were occupied by message values in M) by random values. The two submatrices Sl and S% are symmetric. 


Observe from ([39]) and ( [40] ) that the parameters L>* and a are multiples of /?. This allows us to perform 
striping (explained in Section [TV|) and construct codes for (3 = 1 without loss of generality. 


A. Encoding for d = 2k — 2 


An \l, m}-secure product-matrix MSR code for the parameters [n, fc, d\ is obtained by modifying the 
[n, fc, d\ product-matrix MSR code constructed in Section VII-A Let us denote the product-matrix MSR code 
of Section [VII-A| by C , and the code with security (which will be constructed below) by C*. 

The code C* also belongs to the product-matrix framework, and is described by an (n x a) code matrix 


C* = The matrices and M* are obtained by modifying the matrices 4/ and M of Section VII-A 


as follows. Choose 4/* such that it satisfies the following property in addition to those required for 4/: when 
restricted to the first t columns, any t rows of 4>* are linearly independent. The choice of 4>* as a Vandermode 
matrix as described in Section |VII-A| satisfies this additional property. 

When /? = 1 , the value of R in ( |4T| ) equals R = la + (k — l)m. The message matrix M* for code C* is 
obtained by replacing a specific set of R message symbols in M with random symbols. Recall the structure of 


la 


matrix M* from ( [27] ). Matrix M* is obtained from M by replacing: the 

rows (and hence the first l columns) of the symmetric matrix Si, the 
first (l — 1 ) rows and first (1 — 1 ) columns of the symmetric matrix S 2 , and the (k 




^ symbols in the first 


l 


symbols in the intersection of the 
l)m remaining symbols 


in the first m rows (and hence the first m columns) of S 2 . Fig. [T0| depicts this procedure. 
The following example illustrates this encoding procedure. 


Example 7: In this example, we will construct an MSR code for [n = 6 , k = 3, d = 4]. Let the alphabet of 
operation be F 13 . With /3 = 1, ( [24] ) leads to having a— 2 and B = 6 in the absence of security requirements. 
Let us suppose we wish to provide security from an eavesdropper who can read the data stored in any one 
arbitrary node but not read any data downloaded during repair operations. This corresponds to parameters l — 1 
and m = 0. The maximum size of the message that can be stored ( |40| ) is L>* = 4. We thus choose B — B* = 2 
symbols r± and 7*2 uniformly at random from F 13 . Let us denote the 4 message symbols as {a, 6 , c, d}. In 
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Fig. 11: An MSR code for (n = 7, k — 3, d — 4) providing {£ = 1, m = 0} security from passive 
eavesdroppers: the data is secure from passive eavesdroppers who can read the data stored in any one arbitrary 
node but not to the data passed during repair of any node. The message is {a, 6, c} and the finite field of 
operation is F 13 . The symbols r\ and 7*2 are drawn uniformly (and independently) at random from F 13 . The 
data of any node / (1 < / < 6), when required to protect from compromise of any p = 1 node, is obtained 
by connecting to any {d + 2 p) = 6 nodes and downloading the inner product of the two symbols in each node 
with [1 Xf\ where Xf is the / th element of x = [0 1 3 2 6 5 4]. The code is optimal in the sense that the 
amount of storage required is the minimum possible in this setting (it is MDS), and furthermore, the amount 
of download required for any repair is also the minimum possible under this amount of storage. 


this code, we choose the matrices 4/*, £*, S£ and M* as 
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The code is given by (7* = The data stored by the seven storage nodes under this code is depicted in 

Fig- 031 One can verify that under this code, the data stored in any single node provides no information about 
the message. 


B. Reconstruction, Repair and Security 

The following theorems prove the properties of reconstruction, repair and security in the secure PM-MSR 
code. 

Theorem 8 (Data-reconstruction and node-repair): In the code presented, a data-collector can recover all the 
T>* message symbols by downloading the data stored in any k nodes, and a failed node can be repaired by 
downloading (3 = 1 symbol each from any d remaining nodes. 

Proof: As in the proof of Theorem [3j treating the random symbols also as message symbols, the secure 
PM-MSR code C* becomes identical to the PM-MSR code C. Thus reconstruction and repair in C* are identical 
to that in C. ■ 

Theorem 9 (Information-theoretic security): The code C* ensures m}-security, i.e., an eavesdropper hav- 
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ing access to the data stored in up to £ nodes, and to all the data passed to up to m of these nodes during one 
or more of their repair operations, gets no information about the message. 

Proof: The proof follows the three-step procedure as in the proof in the MBR case (Theorem [4]). Please 
see the Appendix for the complete proof. ■ 


C. Shortening for d > 2k — 2 


For any i > 1, we will now construct product-matrix MSR codes having the parameters [n, fc, d = 2k — 2 + i\, 
for any desired level {£, m} of security. Let us denote this code (which we will construct below) as C*. The code 
C* will satisfy ( [40] ), and is optimal when m G {0,1}. Throughout, we will assume without loss of generality 
that the data is striped, and consider /? = 1 without loss of generality. 


As the first step in this task, a secure product-matrix MSR code is constructed for the parameters 


n — 


n + z, &' = k + z, d! = d + i = 2k — 2 + 2z], with a security level {£ + z, m}. Note that this parameter 
set satisfies d' = 2 k! — 2, and one can employ the encoding procedure described in Section VIII-A for this 
construction. Denote this code as C'*. 


Our goal is to derive the desired code C* by shortening the code C'*. The shortening procedure described in 
Section [VII-E| provides one candidate approach towards this goal. Under this procedure, the code C* is derived 
from C f * by setting the data in the first z nodes to zero, and treating all operations in the resultant code C* 
as operations in C'*. However, this procedure requires the code to be made systematic, which is not possible 
under a code that provides security, since the compromise of any of the systematic nodes would directly reveal 
information about the message. Thus, in this section, we will follow a slightly different route to achieve that 
goal. 


Let matrices 4/'* and M'* respectively denote the (zz' x d') encoding matrix and the {d! x d') message matrix 
of code C'*. We will require the matrix 4/'* to satisfy an additional property that when restricted to the first z 
columns, any set of z rows are linearly independent. The choice of 4/'* as a Vandermonde matrix, as done in 
Section |VII-A[ satisfies this condition. 

Now, for the specific parameters of codes C* and C'*, observe from ( [40] ) that 


5* = 5'* = {k - £){k - 1 + i - m) . 


However, from ([41]), we see that the number of random symbols S* in code C* is 


S* = £{k - 1 + i) + (k - t)m 


and the number of random symbols S'* in C f * is 


S'* = {£ + i)(k - 1 + i) + {k - £)m 

= z(fc-l + z) + S* 

= ia + R* . 

Here, a = (k — 1 + i) = (d — k + 1) = {d! — k' + 1) is the storage capacity per node in both codes C* and C'*. 

To obtain code C* from code C'*, we replace a subset of ia random symbols from M'* by a set of 
deterministically chosen symbols in the following manner. Let 4/'* denote the (z x d) submatrix of 4/'* with 
its rows comprising the first z rows of 4/'*. Consider the following ia entries in M*: (a) the ia — Q) symbols 
in the first z rows (and columns) of the symmetric matrix S'f, and (b) the Q) symbols in the intersection of 
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the first (i — 1) rows and the first (i — 1) columns of S'f. In the construction of code C f * as per the procedure 
of Section [VIII- A[ each of these entries is populated by a randomly chosen value. With the goal of setting the 
data in first i nodes as zero, we instead choose these ia elements in a deterministic manner: once all other 
values of M'* are fixed, we choose these values to satisfy 




S!>* 


= 0 


This is feasible because, by construction, the first i columns of \I/'* are linearly independent. Let M* denote 
the {d! x d!) matrix resulting from this choice. 

The encoding matrix of code C* is chosen to be an {n x d!) matrix comprising the last n rows of T r/ *. 
Finally, the [n, fc, d = 2k — 2 + i\ code C* with security {£, m} is given by C* = 

Theorem 10 (Data-reconstruction and node-repair): In the code presented, a data-collector can recover all 
the message symbols by downloading the data stored in any k nodes, and a failed node can be repaired by 
downloading /? = 1 symbol each from any d remaining nodes. 

Proof: To see how reconstruction and repair are performed in code C*, we can pretend to operate under 
code C'*, and assume that a user (or replacement node) always connects to the first i nodes in addition to the 
k (or d) nodes that it chooses in C*. This is a valid assumption since the data in the first i nodes of C'* is 
known to be all zero. Thus, reconstruction or repair in C* is identical to that in C'*, and is successful since C f * 
is simply an [n f = n + z, k' = k + i, d! — 2k' — 2 = d + i\ product-matrix MSR code. ■ 

Theorem 11 (Information-theoretic security): In the code presented, an eavesdropper having access to the 
data stored in up to t nodes, and to all the data passed to up to m of these nodes during one or more of their 
repair operations, gets no information about the message. 

Proof: Please see the Appendix. ■ 


IX. Necessary and Sufficient conditions to secure any regenerating code from active 

ADVERSARIES 

The conversion of the product-matrix codes into codes providing on-demand security, as described in Sec¬ 
tion [V| and Section |VII[ raises a natural question as to whether any regenerating code can provide on-demand 
security. We answer this question by providing a necessary and sufficient condition for the same. To the best 
of our knowledge, of all codes known to date, the only codes that satisfy this condition are the product-matrix 
codes ITT) . 

Theorem 12: An [n, fc, d] regenerating code satisfying ([2]) can provide on-demand security from active 
adversaries and satisfy ([9]) with equality if and only if it has a repair mechanism that obeys the following 
condition: during any instance of repair, the data passed by an existing node to the replacement node does not 
depend on the identities of the other (d — 1) nodes helping in this repair. 

Proof: Necessity : Consider an [n, fc, d\ regenerating code, and assume for now that (n — d) is an odd 
number. Consider repair of a failed node /. Since the code provides on-demand security, it must be able to 
offer security from the compromise of p = n ~^ x nodes, by allowing the replacement node to download data 
from the remaining {d + 2 p) = (n — 1) nodes. First, observe that all the remaining (n — 1) nodes help in 
the repair, and this removes the dependence on the answer to “which other nodes help in the repair”. Let 
<ti, ..., (j n -1 be the data downloaded from the (n — 1) remaining nodes. Since arbitrary errors in any p of these 
must be correctable, it follows that in the absence of any errors, any <i-sized subset of {cu,... ,cr n _ 1 } must 








34 


suffice for correct decoding. Now consider the operation of this code in the absence of security requirements. 
In this setting, consider the repair scheme wherein a node helping in the repair of a failed node passes precisely 
what it passed in the case with security. The arguments above ensure that the repair will be successful, and the 
data (Jj passed is independent of the identities of the other {d— 1 ) nodes helping in this repair. If the parameters 
of the secure code satisfy ([9]) with equality, then those of the code without security will satisfy ([2]). 

Recall the error detection requirement of on-demand security discussed in Remark [I] When (n — d ) is even, 
we invoke this requirement of detection of errors from the compromise of any p = n — d — 1 nodes, wherein 
the replacement node downloads data from the remaining (d + p) = (n — 1) nodes. The rest of the argument 
is identical to the case of (n — d) being odd considered above. 


Sufficiency: Consider a regenerating code satisfying the property that in the absence of security requirements, 
for repair of any node /, the data passed by any helper node h is independent of the identities of the remaining 
(d— 1) helper nodes. For a desired security level p during repair of node /, let the replacement node connect 
to some (d + 2 p) arbitrary nodes. Let each of these (d + 2 p) nodes pass the data it would have passed to 
the replacement node in the absence of security requirements. Since the data passed by any d of these nodes 
would suffice to recover the desired contents in the absence of errors, it follows that the aggregate data from 
the {d + 2 p) nodes can correct p arbitrary errors. As a result, this data allows for recovery of the desired data 
correctly even in the presence of up to p compromise nodes. An identical argument holds for the reconstruction 
property, thus ensuring that the code provides on-demand security. If the parameters of the original regenerating 
code satisfy Q then those of the secure code will satisfy ([9]) with equality. ■ 

Corollary 13: A code satisfying conditions in Theorem [l2| must necessarily operate in the parameter regime 

d < (n — 1 ). 

Proof: Clearly, since the number of nodes contacted during repair must satisfy (d + 2 p) < {n — 1), the 

requirement of supporting p > 0 requires that n > {d + 1 ). It follows that the code should not restrict the 

number of nodes n to be {d — 1). ■ 


Remark 4: The only explicit regenerating codes that support n > (d + 1) are the high-rate ‘ approximately - 
exact’ MSR codes of [9] and the product-matrix codes fTT) . However, the MSR codes of [ 91 do not satisfy the 
condition provided in Theorem [12} while as shown in Section [V] and Section |VII[ the product-matrix codes do 
satisfy this condition. 


X. Discussion 

The many recent incidents of the compromise of storage systems (e.g., [29|-[31 ]) underscore the importance 
of securing the data in distributed storage systems. Such systems presently employ cryptographic techniques for 
providing security from passive eavesdroppers. Cryptographic techniques generally assume the eavesdropper to 
possess only a bounded computational power. However in practice, upon gaining access to the encrypted data, 
adversaries employ techniques such as intelligent guesswork, dictionary attacks or even crowd-sourcing, and 
often succeed in decoding significant parts of the data. Information-theoretic security, on the other hand, allows 
the adversary to posses unbounded computational power, but assumes that the adversary can gain access to 
only a limited amount of data. In the setting of distributed storage, the information-theoretic security relies on 
the assumption that the adversary cannot gain access to data stored in more than a certain number of nodes. 
However, information-theoretic security typically necessitates much greater resource-overheads; in our setting, 
these are overheads in storage and bandwidth. 

The cryptographic and information-theoretic approaches for security rely on different kinds of assumptions, 
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and the security in distributed storage systems may be enhanced by employing both approaches in tandem. 
To see this, consider a system in which the data is first encoded using conventional cryptographic techniques, 
following which, an information-theoretically secure erasure code is employed for storing the data and handling 
node-failures. In such a setting, the adversary first needs to gain access to a certain minimum number of nodes 
(due to the information-theoretic security), without which it can neither obtain nor corrupt any data; even upon 
managing to gain such an access, the adversary obtains only the encrypted data. As a result, the layer of 
information-theoretic security raises the barrier to gain access to the encrypted data. Furthermore, the security 
provided by the cryptographic encoding allows for the use of only a thin layer of information-theoretic security, 
wherein the threshold on the number of nodes (£, m or p in our setting) can be as low as 2 or 3. This thus 
aids in overcoming the barrier of the significant overheads faced by information-theoretic security. 

An analogy from an end user’s perspective is that of storing data (securely) in cloud-based storage services. 
For instance, one may encode encrypted data using an information-theoretically secure code, and store fractions 
of it in Dropbox, Google Drive and Microsoft Sky drive storage services^] Such a code will ensure that a 
compromise of one of these services provides zero information to the adversary. To get any information about 
the data, the adversary will need to hack into at least two of the three services. 

In this paper, we addressed the information theoretic part. We presented explicit codes for distributed storage 
that offer protection from active adversaries and passive eavesdroppers. For a large range of parameters the 
codes achieve the outer bounds on the system requirements, thus also establishing the capacity of such systems. 
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Appendix 

We now present proofs of security from passive eavesdroppers. We use H(-) to denote Shannon entropy and 
/(•;•) to denote mutual information. All logarithms are taken to base q where q is the size of the finite field of 
operation of the code under consideration. 

Proof of Theorem [4] ({t, m} security under the MBR code): We consider the worst case where the 
eavesdropper indeed gains access to the data stored in some t nodes and the data passed during any of the 
repair operations of some m of these £ nodes. 

Consider the repair of some node / under the product-matrix MBR code, and consider any node i helping in 
the repair process. Let ip*) and denote the encoding vectors of nodes / and i respectively. The data passed by 
node i under the repair algorithm is Since matrix M* is symmetric, this is equal to (rpf)' 1 ' M*tp*. 

Thus, an eavesdropper listening to the data downloaded for repair of node / obtains no more than (^|) T M*, 
which is the data stored in node /. 

2 dropbox.com, drive.google.com, skydrive.live.com. 
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Let 4/* ve be the {£ x d) submatrix of 4/*, with its £ rows comprising the £ rows of 4/ corresponding to the 
nodes to which the eavesdropper has gained access. Thus the eavesdropper has access to the id symbols in the 
(£ x d) matrix E * defined as 

E* = $* ve M* . (42) 

Let £ denote the set of these id symbols that the eavesdropper has gained access to. Further, let U denote the 
set of all £>* message symbols and let 1Z denote the set of all R = B — B * random symbols introduced at 
the encoding stage. In the proof, with some abuse of notation, we will also use the three terms £, U and 1Z to 
denote the random variables corresponding to the respective sets. 

As the first step of the proof, we will now show that given the message symbols as side information, an 
eavesdropper can decode all the random symbols. Under the temporary assumption that the eavesdropper has 
access to all message symbols, the linearity of the code as in ( |42| ) allows the eavesdropper to subtract the 

effect of all message symbols from E*. Denote the resultant (£ x d) matrix as E*. Define M* as a (d x d) 

matrix obtained by setting all message symbols in M* to zero. Thus M* has its first £ rows and first £ columns 
identical to that of M*, and zeros elsewhere. Then, given the message symbols, the eavesdropper now has 
access to 

E* = $* ve M* . (43) 

Recall the property of 4/* ve that any £ rows, when restricted to the first £ columns, are independent. Thus, 
recovering the R random symbols from E is identical to data reconstruction in a product-matrix MBR code 
designed for [h = n, k = £, d = d\ in the absence of security requirements. As a result, given the message 
symbols, the eavesdropper can decode all the random symbols. Hence we have 

H(1Z\£M ) = 0 . (44) 

We will now show that H(£) < R. From the value of R in ( [23] ), it suffices to show that out of the id symbols 
that the eavesdropper has access to, of them are functions (linear combinations) of the rest. To this end, 

consider the {£ x £) matrix 

£*(^e*ve) T = ^e*ve^*(^e*ve) T • (45) 

Since M* is symmetric, the (£ x £) matrix in ([45]) is also symmetric. Thus ( 2 ) dependencies among the elements 
of E* can be described by the ( 2 ) upper-triangular elements of the expression 

^*(^eve) T -^eve(^*) T = 0 . (46) 

The linear-independence of the rows of 4/* ve implies that these (redundant) equations are linearly 

independent. Thus the eavesdropper has access to at most (id - ^ 2 ^) independent symbols, i.e., 

H{£) < R . (47) 

The final part of the proof establishes that given ( [44] ) and ( [47] ), the eavesdropper obtains no information about 
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the message. 


H(£) - H{£\U) 

(48) 

R-H{£\U) 

(49) 

R-H(£\U) + H(£\U,n) 

(50) 

R-I{£]H\U) 

(51) 

R- (H(n\U) - H(K\£,U )) 

(52) 

R-H(Tl\U) 

(53) 

r - H{n) 

(54) 

R-R 

(55) 

0 , 

(56) 


where ( [49] ) follows from ( [47] ); ( [50] ) is a result of the fact that every symbol in the system is a function of U 
and TZ, resulting in H(£\U,1Z) = 0; ( [53] ) follows from ( [44] ); and ( [54] ) follows from the fact that the random 
symbols are independent of the message symbols. ■ 

Proof of Theorem^({£, m} security under the MSR code when d — 2k — 2): We consider the worst case 
where the eavesdropper indeed gains access to the data stored in some £ nodes and the data passed during any 
of the repair operations of some m of these £ nodes. Let be the (£ x d) submatrix of T'*, with its £ rows 
comprising the £ rows of ^ corresponding to the nodes to which the eavesdropper has gained access. Due to 
the specific structure of \I/* as \I/* = [$* A*$*], one can write as \I/| = [<J>| A|$|] where and A| 

are corresponding submatrices of and A*, and are sized {£ x a) and {£ x £) respectively. Further, let 
be the (m x a ) submatrix of <f>*, corresponding to the m nodes whose repair operations are also eavesdropped 
upon. These m nodes are a subset of the set of £ nodes that constitute the matrix \I/* ve , and hence, is a 
submatrix of <E>|. 

Let £ denote the set of symbols that the eavesdropper has gained access to. Further, let U denote the set 
of all L>* message symbols and let 1Z denote the set of all R = B — B * random symbols introduced at the 
encoding stage. In the proof, with some abuse of notation, we will also use the three terms £, U and 1Z to 
denote the random variables corresponding to the respective sets. 


Under the repair algorithm of the product-matrix MSR code (see Theorem [8]), £ comprises the elements of 
the [£ x a) matrix and the elements of the (d x m) matrix M*($^) T . 

Following the approach described in Section |II-B[ we will first show that given the message symbols as side 
information, an eavesdropper can decode all the random symbols. Next, using the properties of the matrix 
and the specific structure of the message matrix M*, we will show that H(£) < R. At this point, the arguments 
in ( [48] ) to ([56]) establish that the eavesdropper obtains no information about the message. 


Step 1: In this step, we show that H{JZ \£, U ) = 0. To this end, assume that a genie provides the values of 
all message symbols U to the eavesdropper. Since the code is linear, the eavesdropper can subtract the effects 
of the message symbols from each of the symbols in £. This allows us to assume (for this step) the values of 
all message symbols to be zero. Define matrix 


M* = 



as being identical to M*, but with all message symbols set to zero. It suffices to show that under a code operating 
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random 

data (assumed zero) 


S* and S * 2 are symmetric 


Fig. 12: Illustration of the order of decoding of the different parts of the message matrix M* in Step 1 of the 
proof of security for MSR codes. Note that the parameters considered here satisfy d = 2k — 2 = 2a. 


with M* as its message matrix, the eavesdropper can decode all the R random symbols. The eavesdropper has 
access to the data \P|M* and and hence equivalently to: 


E{ = ** m Sl 

E*2 = K,S*2 

E* = + A e <S>* e S% 

where E* and E 2 are both sized (rax {k — 1)) and E 3 is sized {£ x {k — 1)). By construction, the intersection 
of the rightmost (k — £) columns and the bottommost (k — 1 — m) rows of the {{k — 1 ) x {k — 1 )) matrix 
contains only the message symbols (depicted as (x) in Fig. [12]) which are set to zero under our assumptions. 
Since the rows of are linearly independent, one can decode the rightmost (k — £) columns of E 2 to recover 
the symbols in the intersection of top m rows and bottom (k — £) columns of (depicted as (i) in Fig. 12). 
The symmetry of matrix S% implies the decoding of the intersection of leftmost m columns and bottommost 
(k — £) rows of §2 as well (depicted as (ii) in Fig. 12). We subtract these symbols out from the remaining data 
and update the values of ji?*, i = 1, 2, 3 j accordingly. Now, consider the rightmost (k — 1 — t) columns of 
E 3 . The rightmost (k — 1 — £) columns of S 2 are either zero or have been subtracted out, and as a result, the 
rightmost (k — 1— £) columns of E% are identical to the (k — 1—t) columns of Since the intersection of 

the bottom (k — l—t) rows and the rightmost {k — l — £) columns of §1 (depicted as (ix) in Fig. 12) comprises 
only of (zero-valued) message symbols, one can decode the rightmost {k — 1 — t) columns of §1 from 
(depicted as (iii) in Fig. 12). Since is symmetric, this also implies decoding of the bottommost {k — 1 — £) 
rows of SI (depicted as (iv) in Fig. 12). Now assume that the values of these decoded symbols are subtracted 
out from j E*, i = 1, 2, 3|. Once again we see that the £th column of S 2 comprises of symbols that are either 
zero-valued or have been subtracted out in the previous steps. As a result, the Ah column of E% are identical 
to the Ah column of (f>}Si. Since the bottommost [k — 1 — £) elements of the Ah column of S{ have been 
subtracted out, one can decode the Ah column of S{ from the Ah column of E 3 (depicted as (v) in Fig. 12). 
Since Sl is symmetric, this also implies decoding of the Ah row of Sl (depicted as (vi) in Fig. 12). Now 
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assume that the values of these decoded symbols are also subtracted out from ji?*, i — 1,2,3 j. We continue 
to focus on E | and observe that the only non-zero elements remaining are the first (£ — 1 ) rows and columns. 
Obtaining the top-left ((£ — 1) x {£—!)) submatrices of S * and §2 (depicted as (vii) and (viii) in Fig. 12) from 
this data is identical to the reconstruction requirement of an MSR product-matrix code with parameters k = £, 
d — 2£ — 2 and a = £ — 1 in the absence of secrecy requirements. We already know m that this is possible, 
and this completes the proof of this step. 

Step 2: We will now show that H{£) < R. The eavesdropper obtains 

E{ = i>* m St 
E*2 = $*mS* 2 
El = + A e $* e SZ , 


where E{ and E\ are both sized (m x (k — 1)), and E | is sized (£ x [k — 1)). These three matrices together 
consist of 

\£\ = 2m{k-l)+£a (57) 

symbols. Recall from ( |4T] ) that R = £a + m{k — £). It thus suffices to show that least 

\£ | — R (2 m(k — 1 ) + £ol) — (£a + m(k — £)) (58) 

= m(k + £ - 2) (59) 


of the symbols in {^,^ 2 ^ 3 } are redundant. 

To this end, first observe that the symmetry of matrices SI and S% makes the matrices £^(T>^) t and E^ ($^) T 
also symmetric. As a result, in each of the matrices El and E%, of the symbols are redundant. Now 

consider matrix E%. The fact that is a sub-matrix of implies that the matrix 

El + A m E\ = + A m^ m S2 

comprises m rows of E\. Thus the ma = m{k — 1) symbols in these m rows of E\ are redundant. Further, 
the relation 


EU^mf = mEl) T + ^(E* 2 ) T 

= QiSii&J 7, + & e S 2 {& m ) T 


( 60 ) 

( 61 ) 


implies that m symbols in the span of every row of E 3 are redundant. Discounting the m rows of E% that we 
previously showed to be redundant, we see that an additional m(£ — m) symbols of E% are redundant. Adding 
these quantities, we get that a total of 

nnri f ITJTj 1 ) 

2 --- - + m(k — 1 ) + m(£ — m) = m(k + l — 2 ) 

symbols in the data { 1^. I'l. /•'■)} obtained by the eavesdropper are redundant, thus meeting the required 
condition ([59j>. 

Step 3: Having shown that H(TZ\£, U) = 0 and H{£) < R , the remainder of the proof is identical to the 
Step 3 of the proof of the MBR case (Theorem]?]). ■ 

Proof of Theorem [77] ( {l, m}-security under the MSR code when d > 2k — 2): Consider an eavesdropper 
who gains access to the data stored in an arbitrary set of £ nodes, and in addition, to the data downloaded 
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for repair by some m of these l nodes. We will now show that under code C*, this eavesdropper obtains no 
information pertaining to the message. To see this, imagine the existence of a set of i additional nodes that store 
only zeros as their data. Then, one can assume without loss of generality that the eavesdropper also has access 
to the data stored in these i nodes. Now, the resultant system is identical to one employing code C'*, with the 
eavesdropper having access to the data stored in a subset of {f + i) nodes, and the data downloaded for repair 
by m of these {t + i) nodes. As shown in Theorem [9} in this case, the eavesdropper obtains no information 
about the message in the system employing code C'*. Finally, since the two codes C'* and C* operate on the 
same message, it follows that the eavesdropper gains no information about the message in code C*. ■ 
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